Recent

Author Topic: AVX and SSE support question  (Read 89736 times)

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #135 on: December 06, 2017, 08:10:49 am »

Just a thing i'm not understing well is your trick with "movhlps xmm1,xmm0 " it an issue with stack, but something escapes me. can you re-explain me ?


Ok this is all to do with return conventions in linux 64 ( SysV x86_64 to be exact), just as win64 has it's 4 registers rest on stack etc.

Spec was kindly sourced by CuriousKit as from this post:

That's a little confusing with Linux, because the way it's behaving implies that it's splitting the 128-bit into two, classing the lower half as SSE and the upper half as SSEUP (see pages 15-17 here: http://refspecs.linuxbase.org/elf/x86_64-abi-0.21.pdf ), but then converting SSEUP to SSE because it thinks it isn't preceded by an SSE argument (which it does... the lower two floats).  Maybe my interpretation is wrong, but it shouldn't need to split it across 2 registers like that.  Can someone with more experience of the Linux ABI shed some light on that?

There are two type identifiers for SSE values as parameters. 
X86_64_SSE_CLASS This signifies[is a pointer / address of] the first 64 bits of a 128 bit SSE value
X86_64_SSEUP_CLASS This signifies[is a pointer / address of] the next 64 bits of a 128 bit SSE value
There can be more that one of these for 256 bit.

One thing you have to keep in the back of your mind in any unix environment when writing code at this level is you have to take into account endianness and not write code which is based on one arch. So this seems to be the way that Unix V deals with this and thus gcc does, and therefore everyone else does. (you would want your libs to link wouldn't you?)

Anyway as seen from the fpc compiler code:
Code: Pascal  [Select][+][-]
  1.  s128real:
  2.   begin
  3.     classes[0].typ:=X86_64_SSE_CLASS;
  4.     classes[0].def:=carraydef.getreusable_no_free(s32floattype,2);
  5.     classes[1].typ:=X86_64_SSEUP_CLASS;
  6.     classes[1].def:=carraydef.getreusable_no_free(s32floattype,2);
  7.     result:=2;
  8.  end;
  9.  

This is exactly how the compiler see's a 128 bit real. So in general terms if we did not use nostackframe then at times the result was placed/wanted on the stack by the fpc compiler. Unlike other platforms ithe stack did not contain a pointer it had allocated 128 bits on the stack for the contents of the mmx reg to be copied to.

After the return from assembler the compiler then did a movq on each of the  two qword and placed the X86_64_SSE_CLASS in  low xmm0 and the X86_64_SSEUP_CLASS in low xmm1. It does this for routines it generates itself. here is the post amble of native pascal version of operator + that does not even use mmx regs for its calcs.

Code: Pascal  [Select][+][-]
  1. # Var $result located at rsp+0, size=OS_128
  2. .........
  3. # [158] End;
  4.         movq    (%rsp),%xmm0
  5.         # Register xmm1 allocated
  6.         movq    8(%rsp),%xmm1
  7.         leaq    24(%rsp),%rsp
  8.         # Register rsp released
  9.         ret
  10.         # Register xmm0,xmm1 released
  11. .Lc12:
  12.  

When we use nostackframe the above postamble does not occur. Therefore we got good values for x and y [low xxm0] but garbage for z and w, the calling convention was taking whatever was in low xmm1.

So using a movhlps xmm1,xmm0 as the last instruction, post whatever you would do if you coded to leave result in xmm0 then ensures the unix abi is conformed to. and we get the right values back ;) 

Phew.. long post, I hope this makes sense to you Jerome.

Peter
« Last Edit: December 06, 2017, 09:04:05 am by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #136 on: December 06, 2017, 10:54:21 am »
Quote
EDIT : I'm also tried to make compare with sse4 PTEST instruction but don't say how without a jump.

Jerome, I briefly looked at this when coding the AVX unit, ignoring all the finer points of that post,
It would seem you would have to load some sort of mask into a mmx reg do a suitable  binary comp on the result of ptest (similar to the xor eax, $f) and set result based on one of the flags.

Now for code as simple as we have at the moment I decided that as we were 'getting out' of the mmx pipline anyway the copy flags to eax and immediate xor where the mask is carried in the instruction and does not lead to a mem access would be cheaper than a potentially far access to 128 bits somewhere in mem.

On the other hand if we need to be as pedantic as that post, which I doubt as our numbers in the end should represent a point or vector in simple 3D space where NaNs are errors in logic and  0, -0 should never occur as if we were doing 'real' math on point in space we would always use some form of epsilon.

This may be a case for simple and quick routines v pedantic routines, Allow choice. TBH personally I would aways go for simple and quick and test for edge cases before main calcs where it is needed.
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #137 on: December 06, 2017, 02:48:02 pm »
...
Whether you need a jump or not depends on the code.  If you just need to set a result based on the zero flag, then you can use SETZ or SETNZ. There's no straight answer.

...
This may be a case for simple and quick routines v pedantic routines, Allow choice. TBH personally I would aways go for simple and quick and test for edge cases before main calcs where it is needed.

Thanks Curiosity it's little bit more clear in my mind. And i agree with you Peter


Ok this is all to do with return conventions in linux 64 ( SysV x86_64 to be exact), just as win64 has it's 4 registers rest on stack etc.
...
When we use nostackframe the above postamble does not occur. Therefore we got good values for x and y [low xxm0] but garbage for z and w, the calling convention was taking whatever was in low xmm1.

So using a movhlps xmm1,xmm0 as the last instruction, post whatever you would do if you coded to leave result in xmm0 then ensures the unix abi is conformed to. and we get the right values back ;) 

Phew.. long post, I hope this makes sense to you Jerome.


Thanks Peter i asked you this because under win32 the same behaviours appeared in MulAdd,MullDiv, Lerp function and also under Win64 with Combine2/3. it's seems depend on how args passed to the function and how manage the stack. It's like the compiler "push result over the stack" anyways all are tested and passed the tests. I've only 1h30 free this afternoon. I'll post the code of Unitest tonight

Thanks

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #138 on: December 06, 2017, 04:03:13 pm »
Ok i'm back and i've finish all testunits with Win32/64 SSE/SSE3/SSE4 and AVX. All with success on my pc. Now, we just miss tests for Linux32


dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #139 on: December 06, 2017, 09:16:41 pm »
Given my comment on using epsilon for equality I though I would come up with this as a possibility.

Code: Pascal  [Select][+][-]
  1. function TGLZVector4f.IsEqual(constref Other: TGLZVector4f; const Epsilon: single); assembler; nostackframe; register;
  2. asm
  3.   movaps xmm0,[RDI]
  4.   movaps xmm1, [Other]
  5.   movss xmm2, [Epsilon]
  6.   shufps xmm2,xmm2, $0
  7.   subps xmm0,xmm1
  8.   andps xmm0, [RIP+cSSE_MASK_ABS]
  9.   cmpps  xmm0, xmm2, cSSE_OPERATOR_LESS_OR_EQUAL
  10.   movmskps eax, xmm0
  11.   xor eax, $f
  12.   setz al
  13. end;      
« Last Edit: December 06, 2017, 09:23:36 pm by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #140 on: December 07, 2017, 12:40:50 am »
Jerome
Here is the inc file for unix32 SSE tested and 100% pass rate for SSe SSE3 SSE4.

One thing I noticed is that I can't use movaps reliably in 32 bit.

Will be out all tomorrow so won't get chance to finish the AVX till friday or saturday

Looks like the win32 AVX code works just fine for the unix32 as well , I just copied and renamed the file in preparation and gave it a blast through the test and 100% with no work to do.

Peter
« Last Edit: December 07, 2017, 12:57:33 am by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

SonnyBoyXXl

  • Jr. Member
  • **
  • Posts: 56
Re: AVX and SSE support question
« Reply #141 on: December 07, 2017, 03:40:34 pm »
WOW, I'm impressed. I've not looked at this thread cause I was out for a business trip, but looks I got a ball running :)

But I found some time to work on the translation of the DirectX Math lib. I can use much of our inputs :)
THX.

So after I have finished it, I will put the files online on github to be available for everyone!

Best regards.

CuriousKit

  • Jr. Member
  • **
  • Posts: 78
Re: AVX and SSE support question
« Reply #142 on: December 07, 2017, 04:35:23 pm »
Just one word of warning... OpenGL and DirectX handle their vectors and matrices differently.  Vectors are row-vectors in DirectX and column-vectors in OpenGL, matrices are row-major in DirectX and column-major in OpenGL, and transformations of vector arrays are performed by post-multiplying in DirectX, and pre-multiplying in OpenGL.

Ultimately, one is the complete transpose of the other.  If you're just passing the resultant vector array into a shader, you can get away with just using one set of functions - otherwise you have to be careful with the ordering in question and not blindly use the same set of functions for both APIs.

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #143 on: December 08, 2017, 06:42:12 pm »
Here is first shot at a timing test framework using FPCUnit again.

Not finished all tests yet but have a look and see if there are some other features wanted.

It outputs csv for the spreadsheet oriented people, github markdown, and Lazarus forum table (misnomered as html atm)
You need to drop these files in the test dir as it uses the TNativeGLZVector4f from unit tests.

Here is example output filtered down to 4 tests for forum table
Compiler Flags: -CfSSE3, -Sv, -dUSE_ASM, -dCONFIG_1
TestNativeAssembler
Vector Op Add Vector0.2390010.066999
Vector Op Add Single0.5530000.070000
Add Vector To Self0.1050000.101000
Add Single To Self0.1010000.099000

Peter


Edit added correct lpr redownload new zip
« Last Edit: December 08, 2017, 07:40:44 pm by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #144 on: December 09, 2017, 09:07:35 am »
Ok getting on with the tests, should have the 'one test to rule them all' finished sometime this weekend.

But as a brain teaser it would seem the compiler is beating us hands down on certain functions. Esp Length.

A few results.

Compiler Flags: -CfSSE3, -Sv, -dUSE_ASM, -dSSE_CONFIG_1
TestNativeAssembler
Vector Length0.0860000.233000
Compiler Flags: -CfSSE42, -Sv, -dUSE_ASM_SSE_4, -dSSE4_CONFIG_1
TestNativeAssembler
Vector Length0.0860000.101000
Compiler Flags: -CfAVX, -Sv, -dUSE_ASM_AVX, -dAVX_CONFIG_1
TestNativeAssembler
Vector Length0.0810000.095000

It would seem it has a trick up it's sleeve where the code 'looks' worse but is more efficient.

Taking the nearest we got which was the AVX code

Ours
Code: Pascal  [Select][+][-]
  1.     vmovaps xmm0,[RDI]
  2.     vandps  xmm0, xmm0, [RIP+cSSE_MASK_NO_W]
  3.     vmulps  xmm0, xmm0, xmm0
  4.     vhaddps xmm0, xmm0, xmm0
  5.     vhaddps xmm0, xmm0, xmm0
  6.     vsqrtss xmm0, xmm0, xmm0        

The compilers
Code: Pascal  [Select][+][-]
  1.         # Register rsp allocated
  2. # Var $self located in register rax
  3. # Var $result located in register xmm0
  4.         # Register rdi,rax allocated
  5. # [136] begin
  6.         movq    %rdi,%rax
  7.         # Register rdi released
  8.         # Register xmm0 allocated
  9. # [137] Result := Sqrt((Self.X * Self.X) +(Self.Y * Self.Y) +(Self.Z * Self.Z));
  10.         vmovss  (%rax),%xmm0
  11.         vmulss  %xmm0,%xmm0,%xmm1
  12.         vmovss  4(%rax),%xmm0
  13.         vmulss  %xmm0,%xmm0,%xmm0
  14.         vaddss  %xmm1,%xmm0,%xmm1
  15.         vmovss  8(%rax),%xmm0
  16.         vmulss  %xmm0,%xmm0,%xmm0
  17.         vaddss  %xmm1,%xmm0,%xmm0
  18.         vsqrtss %xmm0,%xmm0,%xmm0
  19. # Var $result located in register xmm0
  20.         # Register rsp released
  21. # [141] end;
  22.         ret
  23.         # Register xmm0 released

It would seem that the 3 [v]movss (which clears all other bits in reg) is more efficient than two long fetches from mem.
In this function both native and asm leave the result on in xmm0 so I can see no way that the compiler is optimising the loop differently.

Edit:

Just modified the AVX inc to use the compilers code and now we get
Compiler Flags: -CfAVX, -Sv, -dUSE_ASM_AVX, -dAVX_CONFIG_1
TestNativeAssembler
Vector Length0.0830000.081000
which is probably the saving from removal of 20m movq   %rdi,%rax.


« Last Edit: December 09, 2017, 10:52:01 am by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #145 on: December 09, 2017, 01:12:56 pm »
So Looked at this a little more with Distance. The asm was just beating the native. So I tried this just for a test of my method of doing testcoding inside the unit testing framework, as a proof of concept.

Code: Pascal  [Select][+][-]
  1.   {$ifdef TEST}
  2.     vmovq xmm0, [rdi]
  3.     vmovq xmm1, [A]
  4.     vsubps xmm0, xmm0, xmm1
  5.     vmulps xmm0, xmm0, xmm0
  6.     vmovss xmm1, [rdi]8
  7.     vmovss xmm2, [A]8
  8.     vsubps xmm1, xmm1, xmm2
  9.     vmulps xmm1, xmm1, xmm1
  10.     vaddps xmm0, xmm0, xmm1
  11.     vhaddps xmm0, xmm0, xmm0
  12.     vsqrtss xmm0, xmm0, xmm0
  13.   {$else}
  14.     vmovaps xmm0,[RDI]
  15.     vmovaps xmm1, [A]
  16.     vsubps  xmm0, xmm0, xmm1
  17.     vandps  xmm0, xmm0, [RIP+cSSE_MASK_NO_W]
  18.     vmulps  xmm0, xmm0, xmm0
  19.     vhaddps xmm0, xmm0, xmm0
  20.     vhaddps xmm0, xmm0, xmm0
  21.     vsqrtss xmm0, xmm0, xmm0
  22.   {$endif}                      

Code: Pascal  [Select][+][-]
  1. Compiler Flags: -CfAVX, -Sv, -O3 ,-dUSE_ASM_AVX, -dAVX_CONFIG_1
  2. Test,            Native,   Assembler
  3. Vector Distance, 0.104000, 0.096000
  4. Vector Distance, 0.106000, 0.098000
  5. Vector Distance, 0.104000, 0.096000
  6. Vector Distance, 0.103001, 0.102000
  7.  
  8. Compiler Flags: -CfAVX, -Sv, -O3 ,-dUSE_ASM_AVX, -dTEST -dAVX_CONFIG_1_TEST
  9. Vector Distance, 0.099999, 0.088001
  10. Vector Distance, 0.104000, 0.090000
  11. Vector Distance, 0.102000, 0.088000
  12. Vector Distance, 0.104000, 0.088001
  13. Vector Distance, 0.101000, 0.087000
  14. Vector Distance, 0.102000, 0.088001
  15.  

And we see a speed up in code like this too.. did a few runs to verify that the new code was quicker.
Suprising results tbh.
« Last Edit: December 09, 2017, 01:18:30 pm by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #146 on: December 09, 2017, 02:38:20 pm »
Hi Peter very cool timing test. I've made some test only with SSE at this time. One of the optimization possible with 64 bit is ou Vector are Aligned so for example
we can write

Code: Pascal  [Select][+][-]
  1. class operator TGLZVector4f.*(constref A, B: TGLZVector4f): TGLZVector4f; assembler; nostackframe; register;
  2. asm
  3.   movaps xmm0,[A]
  4.   //movaps xmm1,[B]
  5.   mulps  xmm0,[B] //xmm1
  6.   movaps [RESULT], xmm0
  7. end;
  8.  
  9. class operator TGLZVector4f.*(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler; nostackframe; register;
  10. asm
  11.   movaps xmm0,[A]
  12.   //movss  xmm1,[B]
  13.   shufps xmm1,[B] , 0 //xmm1, $00
  14.   mulps  xmm0,xmm1
  15.   movaps [RESULT], xmm0
  16. end;
  17.  
  18. function TGLZVector4f.Negate:TGLZVector4f; assembler; nostackframe; register;
  19. asm
  20.   movaps xmm0,[RCX]
  21.   //movaps xmm1,[RIP+cSSE_MASK_NEGATE]
  22.   xorps xmm0,[RIP+cSSE_MASK_NEGATE] //xmm1
  23.   movaps [RESULT],xmm0
  24. End;
  25.  

we can also optimize CrossProduct :

Code: Pascal  [Select][+][-]
  1. function TGLZVector4f.CrossProduct(constref A: TGLZVector4f): TGLZVector4f;assembler; nostackframe; register;
  2. asm
  3.   movaps xmm0,[RCX]
  4.  //  movaps xmm1, [A]                // xmm1 = v2
  5.   movaps xmm2, xmm0                // xmm2 = v1
  6.  // movaps xmm3, xmm1               // xmm3 = v2
  7.  
  8.   shufps xmm2, xmm0, $d2
  9.   shufps xmm3, [A], $c9  //xmm3, $c9
  10.  
  11.   shufps xmm0, xmm0, $c9  
  12.   shufps xmm1, xmm1, $d2  
  13.  
  14.   //shufps xmm2, xmm2, $d2 // Pass this 2 instructions up
  15.   //shufps xmm3, xmm3, $c9
  16.   mulps  xmm0, xmm1
  17.   mulps  xmm2, xmm3
  18.   subps  xmm0, xmm2
  19.   addps xmm0, [rip+cWOnevector4f] // it would better change by logical operator
  20.   //movhlps xmm1,xmm0
  21.   movaps [RESULT], xmm0      // return result
  22. end;  
  23.  

There code works win Win64bit but perhaps not with Linux64bit

I've also notice, with the functions that return Single, deleting the "movaps" can decrease performance (like min/max and with compares functions)

For AVX, this is the right functions for Distance and Length, your code is based on SS3 instructions, so it's less speed

Code: Pascal  [Select][+][-]
  1. function TGLZVector4f.Distance(constref A: TGLZVector4f):Single;assembler; nostackframe; register;
  2. // Result = xmm0
  3. Asm
  4.   vmovaps xmm0,[RCX]
  5.   //vmovaps xmm1, [A]
  6.   //vsubps  xmm0, xmm0, xmm1
  7.   vsubps  xmm0, xmm0, [A]
  8.   vandps  xmm0, xmm0, [RIP+cSSE_MASK_NO_W]
  9.   vdpps xmm0, xmm0, xmm0, $FF
  10.   vsqrtss xmm0, xmm0 , xmm0
  11.   //  movss [RESULT], {%H-}xmm0
  12. end;
  13.  
  14. function TGLZVector4f.Length:Single;assembler; nostackframe; register;
  15. Asm
  16.   vmovaps xmm0,[RCX]
  17.   vandps  xmm0, xmm0, [RIP+cSSE_MASK_NO_W]
  18.   vdpps xmm0, xmm0, xmm0, $FF
  19.   vsqrtss xmm0, xmm0, xmm0
  20. //  movss [RESULT], {%H-}xmm0
  21. end;

One thing i've noticed is with  procedure "operator" to self and the min/max procedures, sometime they are a little bit better sometime not (between -2%<>+3% gain of speed). except for Dot,Cross, DivideBy2, notmalize....

| Test                                | Native      | Assembler  | Gain in % 
| Vector Op Subtract Vector  | 0.114000  | 0.048000    | 57.895 % 
| Vector Op Add Vector          | 0.118000  | 0.050000    | 57.627 % 
| Vector Op Multiply Vector    | 0.116000  | 0.049000    | 57.758 % 
| Vector Op Divide Vector      | 0.136000  | 0.055000    | 59.559 % 
| Vector Op Add Single          | 0.118000  | 0.050000    | 57.627 % 
| Vector Op Subtract Single    | 0.114000  | 0.051000    | 55.263 % 
| Vector Op Multiply Single      | 0.118000  | 0.051000    | 56.780 % 
| Vector Op Divide Single        | 0.136000  | 0.055000    | 59.559 % 
| Vector Op Negative            | 0.119000  | 0.048000    | 59.664 % 
| Vector Op Equal                  | 0.047000  | 0.042000    | 10.637 % 
| Vector Op GT or Equal          | 0.049000  | 0.050000    | -2.042 % 
| Vector Op LT or Equal          | 0.047000  | 0.043000    | 8.511 % 
| Vector Op Greater              | 0.051000  | 0.050000    | 1.960 % 
| Vector Op Less                  | 0.048000  | 0.042000    | 12.501 % 
| Vector Op Not Equal            | 0.120000  | 0.050000    | 58.334 % 
| Add Vector To Self              | 0.090000  | 0.088000    | 2.222 % 
| Sub Vector from Self          | 0.088000  | 0.088000    | 0.000 % 
| Multiply Vector with Self      | 0.088000  | 0.090000    | -2.273 % 
| Divide Self by Vector          | 0.105000  | 0.107000    | -1.905 % 
| Add Single To Self              | 0.091000  | 0.090000    | 1.098 % 
| Sub Single from Self            | 0.088000  | 0.088000    | 0.000 % 
| Multiply Self with single        | 0.088000  | 0.089000    | -1.137 % 
| Divide Self by single            | 0.105000  | 0.105000    | 0.001 % 
| Invert Self                        | 0.068000  | 0.066999    | 1.472 % 
| Negate Self                        | 0.068000  | 0.066999    | 1.472 % 
| Self Abs                            | 0.067000  | 0.068000    | -1.493 % 
| Self Normalize                    | 0.410000  | 0.339000    | 17.317 % 
| Self Divideby2                    | 0.113000  | 0.093000    | 17.699 % 
| Self CrossProduct Vector      | 0.275000  | 0.188000    | 31.636 % 
| Self Min Vector                  | 0.078000  | 0.068000    | 12.821 % 
| Self Min Single                    | 0.069000  | 0.068000    | 1.450 % 
| Self Max Vector                  | 0.080000  | 0.069000    | 13.749 % 
| Self Max Single                  | 0.067000  | 0.069000    | -2.985 % 


dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #147 on: December 09, 2017, 02:47:34 pm »
Quote
For AVX, this is the right functions for Distance and Length, your code is based on SS3 instructions, so it's less speed

Not according to the tests I have done, The distance is ~ 10% quicker using the code above in tests. If that holds for other platforms or not I will have to wait and see. I suspect the speedup is from the mem access. All mem access in linux64 is using aps variant already, I removed all movups variants. it crashes if you try to use non aligned mem.
« Last Edit: December 09, 2017, 02:51:16 pm by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #148 on: December 09, 2017, 03:00:46 pm »
Anyway my priority is to finsh the test, I see you have added one of the features I have planned, ( Gain in %  ) the other I want to add is report the accuracy to how many dp. Probably more important with larger routines than we are doing now.
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #149 on: December 09, 2017, 03:15:53 pm »
So I cut down the AVX Distance to just this five instructions

Code: Pascal  [Select][+][-]
  1.     vmovaps xmm0,[RDI]
  2.     vsubps  xmm0, xmm0, [A]
  3.     vandps  xmm0, xmm0, [RIP+cSSE_MASK_NO_W]
  4.     vdpps xmm0, xmm0, xmm0, $FF
  5.     vsqrtss xmm0, xmm0, xmm0      

The code passes the functional test, but it is the slowest version so far??????

Vector Distance, 0.101000, 0.115000

Native now beats it.

Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

 

TinyPortal © 2005-2018