Lazarus

Free Pascal => FPC development => Topic started by: dzjorrit on May 25, 2016, 10:22:03 am

Title: AVX and SSE support question
Post by: dzjorrit on May 25, 2016, 10:22:03 am
Hi,
I have the following code:

const
  vectorsize = 4;
type
  tVector=array[0..vectorsize-1] of single;

function vectoradd(a,b:tVector):tVector;
begin
  result:=a+b;
end;
   
This compiles fine when SSE and vector processing are enabled.

But when I increase vectorsize to 8 and enable the AVX compiler options I get this error:
Compile Project, Target: project1.exe: Exit code 1, Errors: 1
unit1.pas(60,12) Error: Internal error 200610072

Is AVX not properly supported yet by fpc or is this a bug?

I'm using lazarus 64bit version 1.6 with fpc 3.0.0 on Windows 10 x64 with AMD A10 AVX enabled processor.

Thanks!
Jorrit




Title: Re: AVX and SSE support question
Post by: Thaddy on May 25, 2016, 10:51:56 am
Which FPC version are you using? That's really important, because AVX is only properly supported from 3.0 and higher.

Ah, I see, 3.0. In that case: how did you compile? An internal error should never happen and should be reported on bugs.freepascal.org. If you ever see an internal error it is a bug by definition.

When you file your bug report give as much information as possible and preferably a complete code example that reproduces the bug.
Title: Re: AVX and SSE support question
Post by: dzjorrit on May 25, 2016, 11:49:38 am
Ok, thanks, I will file a bug report soon. I use a new lazarus project only adding the code I produced in my post and having these compiler options specified:
-O4
-CfAVX
-CpCOREAVX
-OpCOREAVX
-Sv
-XX
-CX
Title: Re: AVX and SSE support question
Post by: Pascal Fan on June 11, 2016, 05:46:05 am
I noticed something else related to the vector processing.  I was playing around with this code this evening, trying a few things, and I noticed that when I used the code posted in this thread with a vector size of 4, and enabled SSE and vector processing with FPC 3.0, I got the same internal error as the poster got when using AVX with a size of 8, if I did an "a xor b" operation instead of an "a + b" operation.  If I did the "a + b" operation as shown in this thread, it works with SSE, but an xor operation will trigger the internal error.  I suspect this isn't correct, because it seems like an xor operation should be possible, and in any event the internal error seems like it's not the correct response.  Anyhow, I thought I should mention it because it looks like there might be some issues with the SSE vector processing as well.
Title: Re: AVX and SSE support question
Post by: shobits1 on June 11, 2016, 06:13:26 am
maybe you should refrain from using -O4 since the compiler help screens contains the following:
Code: [Select]
  -O<x>  Optimizations:
      -O-        Disable optimizations
      -O1        Level 1 optimizations (quick and debugger friendly)
      -O2        Level 2 optimizations (-O1 + quick optimizations)
      -O3        Level 3 optimizations (-O2 + slow optimizations)
      -O4        Level 4 optimizations (-O3 + optimizations which might have unexpected side effects)

maybe I'm wrong.
Title: Re: AVX and SSE support question
Post by: Pascal Fan on June 11, 2016, 06:54:07 am
That's a totally valid point, and you're absolutely correct,  but what I forgot to mention in my post is that my compiles were done with -O3, not the -O4 the original poster used.  So I do think there might be a legitimate bug here.  But you are absolutely correct that -O4 is probably not a good idea!
Title: Re: AVX and SSE support question
Post by: schuler on September 19, 2017, 05:02:14 pm
 :) Hello :)

I have exactly the same problem on FPC 3.0.2 32 bits/windows. Tried with -Cp and -Op COREAVX/COREAVX2 and PENTIUMM.

  :( I can't figure my own login info at bugs.freepascal  :(

Title: Re: AVX and SSE support question
Post by: marcov on September 19, 2017, 05:07:29 pm
Did you check if something already existed ? :-)

https://bugs.freepascal.org/view.php?id=31612
Title: Re: AVX and SSE support question
Post by: Thaddy on September 19, 2017, 05:08:23 pm
:) Hello :)

I have exactly the same problem on FPC 3.0.2 32 bits/windows. Tried with -Cp and -Op COREAVX/COREAVX2 and PENTIUMM.

  :( I can't figure my own login info at bugs.freepascal  :(

You have to specify  -Sv  otherwise the compiler does not do anything interesting,  Your vector also needs to be a two dimensional vector atm. And you have to specify the alignment.
Title: Re: AVX and SSE support question
Post by: schuler on September 19, 2017, 08:58:52 pm
 :) Hello Thaddy,  :)
Thank you for paying attention. Yes. I do have -Sv .

This is a known bug:
https://bugs.freepascal.org/view.php?id=31612

https://bugs.freepascal.org/view.php?id=30186

In the case that someone is interested, I've just tried this with success:

Code: Pascal  [Select][+][-]
  1. {$ASMMODE intel}
  2.  
  3. type
  4.   Single8 = record a,b,c,d,x,y,z,w:Single end;
  5.  
  6. procedure testAsm2();
  7. var
  8.   A: Single8;
  9.   AA: array[0..2] of Single8;
  10.   const ElSize = SizeOf(Single8);
  11. begin
  12.   A.x := 1;
  13.   A.y := 10;
  14.   A.z := 100;
  15.   A.w := 1000;
  16.   A.a := 1000;
  17.   A.b := 1000;
  18.   A.c := 1000;
  19.   A.d := 1000;
  20.  
  21.   asm
  22.     vmovups ymm0,A
  23.     vmovups ymm1,A
  24.     vaddps ymm0, ymm0, ymm1
  25.     vmovups A,ymm0
  26.     vmovups AA[1*ElSize],ymm0
  27.     vmovups AA[2*ElSize],ymm0
  28.   end;
  29.   WriteLn(A.x:6:4,' ',A.y:6:4,' ', A.z:6:4,' ', A.w:6:4);
  30.  
  31.   AA[1].y := 12;
  32.   AA[2].z := 14;
  33.  
  34.   WriteLn(AA[1].x:6:4,' ',AA[1].y:6:4,' ', AA[1].z:6:4,' ', AA[1].w:6:4);
  35.   WriteLn(AA[2].x:6:4,' ',AA[2].y:6:4,' ', AA[2].z:6:4,' ', AA[2].w:6:4);
  36.  
  37.   asm
  38.     vmovups ymm0,AA[1*ElSize]
  39.     vmovups ymm1,AA[2*ElSize]
  40.     vaddps ymm0, ymm0, ymm1
  41.     vmovups AA[0*ElSize],ymm0
  42.   end;
  43. WriteLn(AA[0].x:6:4,' ',AA[0].y:6:4,' ', AA[0].z:6:4,' ', AA[0].w:6:4);
  44.  
  45. end;
  46.  
Title: Re: AVX and SSE support question
Post by: SonnyBoyXXl on November 02, 2017, 01:30:28 pm
This is an interesting topic.
I'm currently working on the translation of the DirectXMath units since
Quote
"The math functions of the D3DX utility library are deprecated for Windows 8. We recommend that you use DirectXMath instead."

I tried now following code with compiler settings
-al
-CfAVX2
-CpCOREAVX2
-O3
-Sv
-OpCOREAVX2
-OoFASTMATH


Code: Pascal  [Select][+][-]
  1. class operator TXMFLOAT4.Add(a: TXMFLOAT4; b: TXMFLOAT4): TXMFLOAT4;
  2. var
  3.   r: TXMFLOAT4;
  4. begin
  5.     result.x:=a.x+b.x;
  6.    result.y:=a.y+b.y;
  7.    result.z:=a.z+b.z;
  8.    result.w:=a.w+b.w;
  9. end;                  

bringing up this assembler code, which looks quit ineffizience:
Code: Pascal  [Select][+][-]
  1. # [292] begin
  2.         pushl   %ebx
  3.         pushl   %esi
  4.         pushl   %edi
  5.         leal    -56(%esp),%esp
  6. .Lc41:
  7. # Var a located at esp+0, size=OS_32
  8. # Var b located at esp+4, size=OS_32
  9. # Var r located at esp+8, size=OS_NO
  10.         movl    %eax,(%esp)
  11.         movl    %edx,4(%esp)
  12.         movl    %ecx,%ebx
  13. # Var $result located in register ebx
  14.         movl    (%esp),%esi
  15.         leal    24(%esp),%edi
  16.         movl    $4,%ecx
  17.         rep
  18.         movsl
  19.         movl    4(%esp),%esi
  20.         leal    40(%esp),%edi
  21.         movl    $4,%ecx
  22.         rep
  23.         movsl
  24. .Ll61:
  25.         movl    %ebx,%eax
  26.         movb    $85,%cl
  27.         movl    $16,%edx
  28.         call    fpc_fillmem
  29.         leal    8(%esp),%eax
  30.         movb    $85,%cl
  31.         movl    $16,%edx
  32.         call    fpc_fillmem
  33. .Ll62:
  34. # [301] result.x:=a.x+b.x;
  35.         vmovss  24(%esp),%xmm0
  36.         vaddss  40(%esp),%xmm0,%xmm0
  37.         vmovss  %xmm0,(%ebx)
  38. .Ll63:
  39. # [302] result.y:=a.y+b.y;
  40.         vmovss  28(%esp),%xmm0
  41.         vaddss  44(%esp),%xmm0,%xmm0
  42.         vmovss  %xmm0,4(%ebx)
  43. .Ll64:
  44. # [303] result.z:=a.z+b.z;
  45.         vmovss  32(%esp),%xmm0
  46.         vaddss  48(%esp),%xmm0,%xmm0
  47.         vmovss  %xmm0,8(%ebx)
  48. .Ll65:
  49. # [304] result.w:=a.w+b.w;
  50.         vmovss  36(%esp),%xmm0
  51.         vaddss  52(%esp),%xmm0,%xmm0
  52.         vmovss  %xmm0,12(%ebx)
  53. .Ll66:
  54. # [305] end;
  55.         leal    56(%esp),%esp
  56.         popl    %edi
  57.         popl    %esi
  58.         popl    %ebx
  59.         ret


I tried now to adapt as follows:

Code: Pascal  [Select][+][-]
  1. class operator TXMFLOAT4.Add(a: TXMFLOAT4; b: TXMFLOAT4): TXMFLOAT4;
  2. var
  3.   r: TXMFLOAT4;
  4. begin
  5.      asm
  6.     vmovups xmm0,a
  7.     vmovups xmm1,b
  8.     vaddps xmm1, xmm0, xmm1
  9.     vmovups r,xmm1
  10.     end;
  11. result:=r;  
  12. end;      

giving this assembler code

Code: Pascal  [Select][+][-]
  1. # [292] begin
  2.         pushl   %ebp
  3. .Lc41:
  4. .Lc42:
  5.         movl    %esp,%ebp
  6. .Lc43:
  7.         leal    -60(%esp),%esp
  8.         pushl   %esi
  9.         pushl   %edi
  10. # Var a located at ebp-4, size=OS_32
  11. # Var b located at ebp-8, size=OS_32
  12. # Var $result located at ebp-12, size=OS_32
  13. # Var r located at ebp-28, size=OS_NO
  14.         movl    %eax,-4(%ebp)
  15.         movl    %edx,-8(%ebp)
  16.         movl    %ecx,-12(%ebp)
  17.         movl    -4(%ebp),%eax
  18.         leal    -44(%ebp),%edi
  19.         movl    %eax,%esi
  20.         movl    $4,%ecx
  21.         rep
  22.         movsl
  23.         movl    -8(%ebp),%esi
  24.         leal    -60(%ebp),%edi
  25.         movl    $4,%ecx
  26.         rep
  27.         movsl
  28. .Ll61:
  29.         movl    -12(%ebp),%eax
  30.         movb    $85,%cl
  31.         movl    $16,%edx
  32.         call    fpc_fillmem
  33.         leal    -28(%ebp),%eax
  34.         movb    $85,%cl
  35.         movl    $16,%edx
  36.         call    fpc_fillmem
  37. #  CPU COREAVX2
  38. .Ll62:
  39. # [294] vmovups xmm0,a
  40.         vmovups -44(%ebp),%xmm0
  41. .Ll63:
  42. # [295] vmovups xmm1,b
  43.         vmovups -60(%ebp),%xmm1
  44. .Ll64:
  45. # [296] vaddps xmm1, xmm0, xmm1
  46.         vaddps  %xmm1,%xmm0,%xmm1
  47. .Ll65:
  48. # [297] vmovups r,xmm1
  49.         vmovups %xmm1,-28(%ebp)
  50. #  CPU COREAVX2
  51. .Ll66:
  52. # [299] result:=r;
  53.         movl    -12(%ebp),%edi
  54.         leal    -28(%ebp),%esi
  55.         movl    $4,%ecx
  56.         rep
  57.         movsl
  58. .Ll67:
  59. # [304] end;
  60.         popl    %edi
  61.         popl    %esi
  62.         movl    %ebp,%esp
  63.         popl    %ebp
  64.         ret
  65.  

which seem quit more effiency cause the math part is done with one call.
any more optimization possible?

and most of all: is assemble manualy really necessary or are there any possibilities with FPC itself?

Thanks!
Title: Re: AVX and SSE support question
Post by: schuler on November 09, 2017, 10:30:22 pm
Hello Sonny.

You may find (or not - not sure) some inspiration with AVX + FPC here:
https://sourceforge.net/p/cai/svncode/HEAD/tree/trunk/lazarus/libs/uvolume.pas

Plus some details here:
https://www.youtube.com/watch?v=qGnfwpKUTIQ
Title: Re: AVX and SSE support question
Post by: Akira1364 on November 18, 2017, 10:44:35 am
Code: Pascal  [Select][+][-]
  1. class operator TXMFLOAT4.Add(a: TXMFLOAT4; b: TXMFLOAT4): TXMFLOAT4;
  2. var
  3.   r: TXMFLOAT4;
  4. begin
  5.     result.x:=a.x+b.x;
  6.    result.y:=a.y+b.y;
  7.    result.z:=a.z+b.z;
  8.    result.w:=a.w+b.w;
  9. end;          
  10.  
     

I'm assuming your translation of XMFLOAT4 is a record type? Try declaring the function like this instead:

Code: Pascal  [Select][+][-]
  1. class operator TXMFLOAT4.Add(constref A, B: TXMFLOAT4): TXMFLOAT4;
Title: Re: AVX and SSE support question
Post by: BeanzMaster on November 18, 2017, 02:22:49 pm
Hello Sonny.

You may find (or not - not sure) some inspiration with AVX + FPC here:
https://sourceforge.net/p/cai/svncode/HEAD/tree/trunk/lazarus/libs/uvolume.pas

Plus some details here:
https://www.youtube.com/watch?v=qGnfwpKUTIQ

Hi, to all
Very interesting subject. Like i've supposed using asm with Arrays of values increase speed.
But if the operation consist of just 1 operation (eg C = A+B), the FPC compiler optimizes, code in a better way, according to my tests 
With arrays it not seems to be the case  8-)

I really need do more research on SSE and AVX in this way for improving  GLScene's VectorMaths units
Title: Re: AVX and SSE support question
Post by: Akira1364 on November 18, 2017, 04:10:45 pm
Ok, I just tested out my suggestion for Sonny with code that looks like this:

Code: Pascal  [Select][+][-]
  1. program DXMathTest;
  2.  
  3. type
  4.   TXMFloat4 = record
  5.     X, Y, Z, W: Single;
  6.     class operator Add(constref A, B: TXMFloat4): TXMFloat4; inline;
  7.   end;
  8.  
  9.   class operator TXMFloat4.Add(constref A, B: TXMFloat4): TXMFloat4;
  10.   begin
  11.     with Result do
  12.     begin
  13.       X := A.X + B.X;
  14.       Y := A.Y + B.Y;
  15.       Z := A.Z + B.Z;
  16.       W := A.W + B.W;
  17.     end;
  18.   end;
  19.  
  20. begin
  21. end.

After building it with identical compiler flags as the ones he said he was using, as I expected, the added "constref" makes the assembler output much more reasonable:

Code: Pascal  [Select][+][-]
  1. # [10] begin
  2.         movq    %rcx,%rax
  3. # Var $result located in register rax
  4. # Var A located in register rdx
  5. # Var B located in register r8
  6. # [13] X := A.X + B.X;
  7.         vmovss  (%rdx),%xmm0
  8.         vaddss  (%r8),%xmm0,%xmm0
  9.         vmovss  %xmm0,(%rax)
  10. # [14] Y := A.Y + B.Y;
  11.         vmovss  4(%rdx),%xmm0
  12.         vaddss  4(%r8),%xmm0,%xmm0
  13.         vmovss  %xmm0,4(%rax)
  14. # [15] Z := A.Z + B.Z;
  15.         vmovss  8(%rdx),%xmm0
  16.         vaddss  8(%r8),%xmm0,%xmm0
  17.         vmovss  %xmm0,8(%rax)
  18. # [16] W := A.W + B.W;
  19.         vmovss  12(%rdx),%xmm0
  20.         vaddss  12(%r8),%xmm0,%xmm0
  21.         vmovss  %xmm0,12(%rax)
  22. # [18] end;
  23.         ret

Moral of the story: always pass record types as "constref" or at the very least "const" anywhere/anytime it's possible to do so, in order to avoid making copies of them everytime the method is called.
Title: Re: AVX and SSE support question
Post by: marcov on November 18, 2017, 08:35:55 pm
FYI my earlier not working attempts:

Code: Pascal  [Select][+][-]
  1.      program DXMathTest;
  2.   {$mode delphi}    
  3.     type
  4.       TXMFloat4 = record
  5.         sX:array[0..3] of single;
  6.         class operator Add(constref A, B: TXMFloat4): TXMFloat4; inline;
  7.         property x : single read sx[0] write sx[0];
  8.         property y : single read sx[1] write sx[1];
  9.         property z : single read sx[2] write sx[2];
  10.         property w : single read sx[3] write sx[3];
  11.       end;
  12.      
  13.       class operator TXMFloat4.Add(constref A, B: TXMFloat4): TXMFloat4;
  14.       begin
  15.         result.Sx:=a.sX+b.sX;
  16.       end;
  17.      
  18.      var x,y,z : txmfloat4;
  19.     begin
  20.       x.x:=1; x.y:=2; x.z:=3; x.w:=4;
  21.       y.x:=5; y.y:=6; y.z:=7; y.w:=8;
  22.       z:=x+y;
  23.       writeln(z.x);
  24.    
  25.     end.

and compile with

fpc -CfAVX -CpCOREAVX  -O3 -Sv -OpCOREAVX -OoFASTMATH avstest4 -Si -al

It doesn't work because no code generation is done for storing the value in result.  (so a+b goes fine, but assigning that to result (z) goes wrong, both inlined and not:

Code: Pascal  [Select][+][-]
  1. # [22] z:=x+y;
  2.         movdqa  U_$P$DXMATHTEST_$$_X(%rip),%xmm0
  3.         addps   U_$P$DXMATHTEST_$$_Y(%rip),%xmm0
  4.         vmovups 40(%rsp),%xmm0
  5.         vmovups %xmm0,U_$P$DXMATHTEST_$$_Z(%rip)
  6.  
Title: Re: AVX and SSE support question
Post by: Akira1364 on November 18, 2017, 09:53:23 pm
Interesting! Definitely seems to work fine with free-standing values in the record (not properties), though. I also just tested the following version which makes it a variant record with both free values and a static array:

Code: Pascal  [Select][+][-]
  1. program DXMathTest;
  2.  
  3. {$mode Delphi}
  4.  
  5. uses
  6.   SysUtils;
  7.  
  8. type
  9.   TXMFloat4 = record
  10.     class operator Add(constref A, B: TXMFloat4): TXMFloat4; inline;
  11.     case Byte of
  12.       0: (X, Y, Z, W: Single);
  13.       1: (V4: array[0..3] of Single);
  14.   end;
  15.  
  16.   class operator TXMFloat4.Add(constref A, B: TXMFloat4): TXMFloat4;
  17.   begin
  18.     with Result do
  19.     begin
  20.       X := A.X + B.X;
  21.       Y := A.Y + B.Y;
  22.       Z := A.Z + B.Z;
  23.       W := A.W + B.W;
  24.     end;
  25.   end;
  26.  
  27. const
  28.   A: TXMFloat4 = (X: 5.0; Y: 5.0; Z: 5.0; W: 0.5);
  29.   B: TXMFloat4 = (X: 2.5; Y: 2.5; Z: 2.5; W: 0.5);
  30.  
  31. var
  32.   C: TXMFloat4;
  33.  
  34. begin
  35.   C := A + B;
  36.   with C do
  37.   begin
  38.     WriteLn(X.ToString + #32 + Y.ToString
  39.             + #32 + Z.ToString + #32 + W.ToString + #13);
  40.     WriteLn(V4[0].ToString + #32 + V4[1].ToString
  41.             + #32 + V4[2].ToString + #32 + V4[3].ToString);
  42.   end;
  43.   ReadLn;
  44. end.


Used the same compiler settings as before and had no issues there, either. Both WriteLns print out ('7.5, 7.5, 7.5, 1') as expected.
Title: Re: AVX and SSE support question
Post by: marcov on November 18, 2017, 09:58:19 pm
You have to add the array form to make -Sv work, so that the add is one instruction for all 4 singles.
Title: Re: AVX and SSE support question
Post by: Akira1364 on November 18, 2017, 10:11:31 pm
Ah, didn't realize you were focusing specifically on the -Sv functionality there. Also didn't know -Sv was limited to arrays... I was under the impression that it was more of a general "hint" for the compiler to attempt to auto-vectorize where possible.

Still though, even without -Sv, for Sonny's purposes I think simply adding constref to his existing method will get him a lot closer performance-wise to where he wants (while still actually working), as my test showed. From 50-something lines of ASM down to 23 isn't bad at all IMO.
Title: Re: AVX and SSE support question
Post by: BeanzMaster on November 19, 2017, 11:07:25 pm
Hi i've play a little bit with SSE and AVX

This is the code i've used

Code: Pascal  [Select][+][-]
  1.     TBZVector4f = packed record
  2.       case integer of
  3.        0 : (V:Array[0..3] of Single);
  4.        1 : (X, Y, Z, W : single);
  5.     end;
  6.     TBZVector = TBZVector4f;
  7.  
  8. function nc_VectorAdd(ConstRef AVector, AVector2: TBZVector):TBZVector;
  9. begin
  10.  result.x:=AVector.x + AVector2.x;
  11.  result.y:=AVector.y + AVector2.y;
  12.  result.z:=AVector.z + AVector2.z;
  13.  result.w :=AVector.w+ AVector2.w;
  14. end;
  15.  
  16. function asm_sse_VectorAdd(ConstRef V1, V2: TBZVector):TBZVector; assembler;nostackframe;register;
  17. asm
  18.   movups xmm0,[V1]
  19.   movups xmm1,[V2]
  20.   addps xmm0,xmm1
  21.   movups [RESULT], XMM0
  22. end;
  23.  
  24. function asm_avx_VectorAdd(ConstRef V1, V2: TBZVector):TBZVector; assembler;nostackframe;register;
  25. asm
  26.   vmovups xmm0,[V1]
  27.   vmovups xmm1,[V2]
  28.   vaddps xmm0,xmm1, xmm0
  29.   vmovups [RESULT], XMM0
  30. end;

The code for the test,  a simple loop with two vectors Initialized before of course like this

Code: Pascal  [Select][+][-]
  1.  
  2.   v1:=VectorMake(1.198,1.264,1.387);
  3.   v2:=VectorMake(2.542,2.289,2.311);  
  4.  
  5. For i:=0 to 9999999 do
  6. begin
  7.     V:=asm_avx_VectorAdd(V1,V2);
  8. end;

and with there compiler options :

Quote
   
    -al
    -O3
    -Sv
    -OoFASTMATH
    -CfAVX
    -CpCOREAVX
    -OpCOREAVX
    -CPPACKRECORD=8

Finally the result :

    - AVX     =  : 14318.3959415555 µs
    - SSE     =  : 15241.6712796688 µs
    - NATIF   =  : 23578.505629003 µs

In conclusion without this options :

Quote

 -CfAVX
 -CpCOREAVX
 -OpCOREAVX

1st - The SSE performance  fall down and native code is better 
2nd - Without "NoStackFrame and Register" the performance decrease (both with SSE and AVX)
3rd - By using movaps, vmovaps instead of movups, vmovups,  not make a big differences

Now i don't check the output in the windows assembly so....next time




Title: Re: AVX and SSE support question
Post by: Akira1364 on November 20, 2017, 12:38:18 am
You could even use the assembler approach within an operator to replace the apparently buggy -Sv functionality, by the way. I tested your asm (with VMOVAPS instead of VMOVUPS as the alignment is already known/correct by default in FPC) with my example from yesterday and the results were still valid/exactly as expected:

Code: Pascal  [Select][+][-]
  1. program DXMathTest;
  2.  
  3. {$modeswitch AdvancedRecords}
  4.  
  5. uses
  6.   SysUtils;
  7.  
  8. type
  9.   TXMFloat4 = record
  10.     class operator +(constref A, B: TXMFloat4): TXMFloat4; assembler;
  11.     case Byte of
  12.       0: (X, Y, Z, W: Single);
  13.       1: (V4: array[0..3] of Single);
  14.   end;
  15.  
  16.   class operator TXMFloat4.+(constref A, B: TXMFloat4): TXMFloat4; assembler;
  17.   asm
  18.     VMOVAPS XMM0,[A]
  19.     VMOVAPS XMM1,[B]
  20.     VADDPS XMM0,XMM1, XMM0
  21.     VMOVAPS [RESULT], XMM0
  22.   end;
  23.  
  24. const
  25.   A: TXMFloat4 = (X: 5.0; Y: 5.0; Z: 5.0; W: 0.5);
  26.   B: TXMFloat4 = (X: 2.5; Y: 2.5; Z: 2.5; W: 0.5);
  27.  
  28. var
  29.   C: TXMFloat4;
  30.  
  31. begin
  32.   C := A + B;
  33.   with C do
  34.   begin
  35.     WriteLn(X.ToString, #32, Y.ToString,
  36.             #32, Z.ToString, #32, W.ToString);
  37.     WriteLn(V4[0].ToString, #32, V4[1].ToString,
  38.             #32, V4[2].ToString, #32, V4[3].ToString);
  39.   end;
  40.   ReadLn;
  41. end.

Note that I didn't declare the record as packed, as doing so (at least in this case) doesn't actually seem to do anything at all. The assembler output with it packed and unpacked was completely identical.

I used slightly different, simpler compiler options this time, as well:
Code: Pascal  [Select][+][-]
  1. -al -CfAVX2 -CpCOREAVX2 -O4 -OpCOREAVX2

That being said though, it certainly seems like -Sv is very very close to working properly and probably isn't missing a huge amount of code. Anyone know exactly where in the compiler codebase it's implemented?
Title: Re: AVX and SSE support question
Post by: marcov on November 20, 2017, 07:04:33 am
And -Sv can be inlined, so sequences of such operations would be more optimal.

I don't know much about compiler internals, but I usually lookup the commandline parsing in options.pas to see what the switch (-Sv) sets, and then grep for that

Title: Re: AVX and SSE support question
Post by: Nitorami on November 20, 2017, 10:11:26 am
As to the -Sv switch, I learned that it had been introduced a few years ago, but there is nobody to maintain it, and there are no test cases, so it is probably broken. Al least this is what Jonas said somewhere in the bug tracker on this topic. All -Sv seems to do is to give you a false impression of working because it makes the compiler accept operations with vectors of floats, but the result is garbage. If anyone can produce a working example using the -Sv switch, I would be very interested.
Title: Re: AVX and SSE support question
Post by: Thaddy on November 20, 2017, 10:26:18 am
-Sv *does* work, but I have to test if it works with anything else than a fixed array with a size determined by power of two.
For which I don't have any code yet, because I assumed ...it had to be. I only use it for audio.
If you examine the assembler output (-s) at least on X64 and armhf) you will see it definitely works.
Title: Re: AVX and SSE support question
Post by: BeanzMaster on November 20, 2017, 04:42:26 pm
Hi i've made another test based on the code by Akira

Code: Pascal  [Select][+][-]
  1.  
  2. Unit Unit1;
  3.  
  4. {$mode objfpc}{$H+}
  5. {$DEFINE USE_ASM}
  6. {.$DEFINE USE_SSE_ASM}
  7. {$DEFINE USE_AVX_ASM}
  8. {$MODESWITCH ADVANCEDRECORDS}
  9.  
  10. Interface
  11.  
  12. Uses
  13.   Classes, Sysutils, Fileutil, Forms, Controls, Graphics, Dialogs, StdCtrls;
  14.  
  15.  
  16.  
  17. Type
  18.   { Tform1 }
  19.   Tform1 = Class(Tform)
  20.     Button1 : Tbutton;
  21.     Memo1 : Tmemo;
  22.     Procedure Button1click(Sender : Tobject);
  23.   Private
  24.  
  25.   Public
  26.  
  27.   End;
  28.  
  29. type
  30.   TGLZVector3fType = array[0..2] of Single;
  31.   TGLZVector4fType = array[0..3] of Single;
  32.  
  33.   TGLZVector3f = record
  34.     case Byte of
  35.       0: (X, Y, Z: Single);
  36.       1: (V: TGLZVector3fType);
  37.   End;
  38.  
  39.   TGLZVector4f = record
  40.     public
  41.       procedure Create(Const aX,aY,aZ,aW : Single);
  42.       function ToString : String;
  43.  
  44.       class operator +(constref A, B: TGLZVector4f): TGLZVector4f; overload;
  45.       class operator -(constref A, B: TGLZVector4f): TGLZVector4f; overload;
  46.       class operator *(constref A, B: TGLZVector4f): TGLZVector4f; overload;
  47.       class operator /(constref A, B: TGLZVector4f): TGLZVector4f; overload;
  48.  
  49.       class operator +(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; overload;
  50.       class operator -(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; overload;
  51.       class operator *(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; overload;
  52.       class operator /(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; overload;
  53.  
  54.       case Byte of
  55.         0: (X, Y, Z, W: Single);
  56.         1: (V: TGLZVector4fType);
  57.         2: (AsVector3f : TGLZVector3f);
  58.   end;
  59.  
  60.  
  61. Var
  62.   Form1 : Tform1;
  63.  
  64. Implementation
  65.  
  66. {$R *.lfm}
  67.  
  68. { Tform1 }
  69.  
  70. Procedure Tform1.Button1click(Sender : Tobject);
  71. Var
  72.   V1, V2, V3 : TGLZVector4f;
  73.   Float : Single;
  74. Begin
  75.   Float := 1.5;
  76.   v1.Create(5.0,5.0,5.0,0.5);
  77.   v2.Create(2.5,2.5,2.5,0.5);
  78.  
  79.   Memo1.Lines.Add('V1 = '+v1.ToString);
  80.   Memo1.Lines.Add('V2 = '+v2.ToString);
  81.   Memo1.Lines.Add('Float = 1.5');
  82.   Memo1.Lines.Add('');
  83.   Memo1.Lines.Add('Operations : ');
  84.   Memo1.Lines.Add('----------------------------------------');
  85.   V3 := V1 + V2;
  86.   Memo1.Lines.Add('V3 = V1 + V2 = '+v3.ToString);
  87.   V3 := V1 - V2;
  88.   Memo1.Lines.Add('V3 = V1 - V2 = '+v3.ToString);
  89.   V3 := V1 * V2;
  90.   Memo1.Lines.Add('V3 = V1 * V2 = '+v3.ToString);
  91.   V3 := V1 / V2;
  92.   Memo1.Lines.Add('V3 = V1 / V2 = '+v3.ToString);
  93.   Memo1.Lines.Add('----------------------------------------');
  94.   V3 := V1 + Float;
  95.   Memo1.Lines.Add('V3 = V1 + Float = '+v3.ToString);
  96.   V3 := V1 - Float;
  97.   Memo1.Lines.Add('V3 = V1 - Float = '+v3.ToString);
  98.   V3 := V1 * Float;
  99.   Memo1.Lines.Add('V3 = V1 * Float = '+v3.ToString);
  100.   V3 := V1 / Float;
  101.   Memo1.Lines.Add('V3 = V1 / Float = '+v3.ToString);
  102. End;
  103.  
  104.  
  105. procedure TGLZVector4f.Create(Const aX,aY,aZ,aW : Single);
  106. begin
  107.    Self.X := AX;
  108.    Self.Y := AY;
  109.    Self.Z := AZ;
  110.    Self.W := AW;
  111. end;
  112.  
  113. function TGLZVector4f.ToString : String;
  114. begin
  115.    Result := '(X: '+FloattoStrF(Self.X,fffixed,5,5)+
  116.             ' ,Y: '+FloattoStrF(Self.Y,fffixed,5,5)+
  117.             ' ,Z: '+FloattoStrF(Self.Z,fffixed,5,5)+
  118.             ' ,W: '+FloattoStrF(Self.W,fffixed,5,5)+')';
  119. End;
  120.  
  121. {$IFDEF USE_ASM}
  122. {$IFDEF USE_AVX_ASM}
  123. class operator TGLZVector4f.+(constref A, B: TGLZVector4f): TGLZVector4f; assembler;
  124. asm
  125.   VMOVUPS XMM0,[A]
  126.   VMOVUPS XMM1,[B]
  127.   VADDPS  XMM0,XMM1, XMM0
  128.   VMOVUPS [RESULT], XMM0
  129. end;
  130.  
  131. class operator TGLZVector4f.-(constref A, B: TGLZVector4f): TGLZVector4f; assembler;
  132. asm
  133.   VMOVUPS XMM0,[A]
  134.   VMOVUPS XMM1,[B]
  135.   VSUBPS  XMM0,XMM1, XMM0
  136.   VMOVUPS [RESULT], XMM0
  137. end;
  138.  
  139. class operator TGLZVector4f.*(constref A, B: TGLZVector4f): TGLZVector4f; assembler;
  140. asm
  141.   VMOVUPS XMM0,[A]
  142.   VMOVUPS XMM1,[B]
  143.   VMULPS  XMM0,XMM1, XMM0
  144.   VMOVUPS [RESULT], XMM0
  145. end;
  146.  
  147. class operator TGLZVector4f./(constref A, B: TGLZVector4f): TGLZVector4f; assembler;
  148. asm
  149.   VMOVUPS XMM0,[A]
  150.   VMOVUPS XMM1,[B]
  151.   VDIVPS  XMM0,XMM1, XMM0
  152.   VMOVUPS [RESULT], XMM0
  153. end;
  154.  
  155. class operator TGLZVector4f.+(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler;
  156. asm
  157.   VMOVUPS XMM0,[A]
  158.   VMOVSS  XMM1,[B]
  159.   VSHUFPS XMM1, XMM1, XMM1,0
  160.   VADDPS  XMM0,XMM1, XMM0
  161.   VMOVUPS [RESULT], XMM0
  162. end;
  163.  
  164. class operator TGLZVector4f.-(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler;
  165. asm
  166.   VMOVUPS XMM0,[A]
  167.   VMOVSS  XMM1,[B]
  168.   VSHUFPS XMM1, XMM1, XMM1,0
  169.   VSUBPS  XMM0,XMM1, XMM0
  170.   VMOVUPS [RESULT], XMM0
  171. end;
  172.  
  173. class operator TGLZVector4f.*(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler;
  174. asm
  175.   VMOVUPS XMM0,[A]
  176.   VMOVSS  XMM1,[B]
  177.   VSHUFPS XMM1, XMM1, XMM1,0
  178.   VMULPS  XMM0,XMM1, XMM0
  179.   VMOVUPS [RESULT], XMM0
  180. end;
  181.  
  182. class operator TGLZVector4f./(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler;
  183. asm
  184.   VMOVUPS XMM0,[A]
  185.   VMOVSS  XMM1,[B]
  186.   VSHUFPS XMM1, XMM1, XMM1,0
  187.   VDIVPS  XMM0,XMM1, XMM0
  188.   VMOVUPS [RESULT], XMM0
  189. end;
  190.  
  191. {$ENDIF}
  192.  
  193. {$IFDEF USE_SSE_ASM}
  194. class operator TGLZVector4f.+(constref A, B: TGLZVector4f): TGLZVector4f; assembler;
  195. asm
  196.   MOVUPS XMM0,[A]
  197.   MOVUPS XMM1,[B]
  198.   ADDPS  XMM0,XMM1
  199.   MOVUPS [RESULT], XMM0
  200. end;
  201.  
  202. class operator TGLZVector4f.-(constref A, B: TGLZVector4f): TGLZVector4f; assembler;
  203. asm
  204.   MOVUPS XMM0,[A]
  205.   MOVUPS XMM1,[B]
  206.   SUBPS  XMM0,XMM1
  207.   MOVUPS [RESULT], XMM0
  208. end;
  209.  
  210. class operator TGLZVector4f.*(constref A, B: TGLZVector4f): TGLZVector4f; assembler;
  211. asm
  212.   MOVUPS XMM0,[A]
  213.   MOVUPS XMM1,[B]
  214.   MULPS  XMM0,XMM1
  215.   MOVUPS [RESULT], XMM0
  216. end;
  217.  
  218. class operator TGLZVector4f./(constref A, B: TGLZVector4f): TGLZVector4f; assembler;
  219. asm
  220.   MOVUPS XMM0,[A]
  221.   MOVUPS XMM1,[B]
  222.   DIVPS  XMM0,XMM1
  223.   MOVUPS [RESULT], XMM0
  224. end;
  225.  
  226. class operator TGLZVector4f.+(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler;
  227. asm
  228.   MOVUPS XMM0,[A]
  229.   MOVSS  XMM1,[B]
  230.   SHUFPS XMM1, XMM1,0
  231.   ADDPS  XMM0,XMM1
  232.   MOVUPS [RESULT], XMM0
  233. end;
  234.  
  235. class operator TGLZVector4f.-(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler;
  236. asm
  237.   MOVUPS XMM0,[A]
  238.   MOVSS  XMM1,[B]
  239.   SHUFPS XMM1, XMM1,0
  240.   SUBPS  XMM0,XMM1
  241.   MOVUPS [RESULT], XMM0
  242. end;
  243.  
  244. class operator TGLZVector4f.*(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler;
  245. asm
  246.   MOVUPS XMM0,[A]
  247.   MOVSS  XMM1,[B]
  248.   SHUFPS XMM1, XMM1,0
  249.   MULPS  XMM0,XMM1
  250.   MOVUPS [RESULT], XMM0
  251. end;
  252.  
  253. class operator TGLZVector4f./(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler;
  254. asm
  255.   MOVUPS XMM0,[A]
  256.   MOVSS  XMM1,[B]
  257.   SHUFPS XMM1, XMM1,0
  258.   DIVPS  XMM0,XMM1
  259.   MOVUPS [RESULT], XMM0
  260. end;
  261.  
  262.  
  263. {$ENDIF}
  264. {$ELSE}
  265. class operator TGLZVector4f.+(constref A, B: TGLZVector4f): TGLZVector4f;
  266. begin
  267.   Result.X := A.X + B.X;
  268.   Result.X := A.Y + B.Y;
  269.   Result.X := A.Z + B.Z;
  270.   Result.X := A.W + B.W;
  271. end;
  272. {$ENDIF}            
  273.  
Results with :

V1 = (X: 5.00000 ,Y: 5.00000 ,Z: 5.00000 ,W: 0.50000)
V2 = (X: 2.50000 ,Y: 2.50000 ,Z: 2.50000 ,W: 0.50000)
Float = 1.5

SSE Operations :
----------------------------------------
V3 = V1 + V2 = (X: 7.50000 ,Y: 7.50000 ,Z: 7.50000 ,W: 1.00000) OK
V3 = V1 - V2 = (X: 2.50000 ,Y: 2.50000 ,Z: 2.50000 ,W: 0.00000)OK
V3 = V1 * V2 = (X: 12.50000 ,Y: 12.50000 ,Z: 12.50000 ,W: 0.25000) OK
V3 = V1 / V2 = (X: 2.00000 ,Y: 2.00000 ,Z: 2.00000 ,W: 1.00000) OK
----------------------------------------
V3 = V1 + Float = (X: 6.50000 ,Y: 6.50000 ,Z: 6.50000 ,W: 2.00000) OK
V3 = V1 - Float = (X: 3.50000 ,Y: 3.50000 ,Z: 3.50000 ,W: -1.00000)OK
V3 = V1 * Float = (X: 7.50000 ,Y: 7.50000 ,Z: 7.50000 ,W: 0.75000) OK
V3 = V1 / Float = (X: 3.33333 ,Y: 3.33333 ,Z: 3.33333 ,W: 0.33333) OK
AVX Operations :
----------------------------------------
V3 = V1 + V2 = (X: 7.50000 ,Y: 7.50000 ,Z: 7.50000 ,W: 1.00000) OK
V3 = V1 - V2 = (X: -2.50000 ,Y: -2.50000 ,Z: -2.50000 ,W: 0.00000) NOK
V3 = V1 * V2 = (X: 12.50000 ,Y: 12.50000 ,Z: 12.50000 ,W: 0.25000) OK
V3 = V1 / V2 = (X: 0.50000 ,Y: 0.50000 ,Z: 0.50000 ,W: 1.00000) NOK
----------------------------------------
V3 = V1 + Float = (X: 6.50000 ,Y: 6.50000 ,Z: 6.50000 ,W: 2.00000) OK
V3 = V1 - Float = (X: -3.50000 ,Y: -3.50000 ,Z: -3.50000 ,W: 1.00000) NOK
V3 = V1 * Float = (X: 7.50000 ,Y: 7.50000 ,Z: 7.50000 ,W: 0.75000) b]OK[/b]
V3 = V1 / Float = (X: 0.30000 ,Y: 0.30000 ,Z: 0.30000 ,W: 3.00000) NOK

Like you see Substraction and Division give wrong results with AVX, with sub it give negative values  %)
 I have probably forgotten something that I do not understand yet  :-[

If someone can explain where is my error, it will be cool.

Thanks

PS: Tested with  : -al -CfAVX2 -CpCOREAVX2 -O4 -OpCOREAVX2
and  -al -CfAVX -CpCOREAVX -O4 -OpCOREAVX
Title: Re: AVX and SSE support question
Post by: Thaddy on November 20, 2017, 04:53:29 pm
Well, apart from the inconsequential command line.....< >:D>
Your problem is with V3.
It would be really helpful if you examine the actual assembler output. So: -a option and examine .s

Also note you need to pack both arrays and records...
Title: Re: AVX and SSE support question
Post by: Akira1364 on November 20, 2017, 05:20:25 pm
You just need to invert A and B in the assembler for subtraction and division. Basically it should look like this:

Code: Pascal  [Select][+][-]
  1.   class operator TXMFloat4.+(constref A, B: TXMFloat4): TXMFloat4; assembler;
  2.   asm
  3.     VMOVAPS XMM0,[A]
  4.     VMOVAPS XMM1,[B]
  5.     VADDPS XMM0,XMM1, XMM0
  6.     VMOVAPS [RESULT], XMM0
  7.   end;
  8.  
  9.   class operator TXMFloat4.-(constref A, B: TXMFloat4): TXMFloat4; assembler;
  10.   asm
  11.     VMOVAPS XMM0,[B]
  12.     VMOVAPS XMM1,[A]
  13.     VSUBPS XMM0,XMM1, XMM0
  14.     VMOVAPS [RESULT], XMM0
  15.   end;
  16.  
  17.   class operator TXMFloat4./(constref A, B: TXMFloat4): TXMFloat4; assembler;
  18.   asm
  19.     VMOVAPS XMM0,[B]
  20.     VMOVAPS XMM1,[A]
  21.     VDIVPS XMM0,XMM1, XMM0
  22.     VMOVAPS [RESULT], XMM0
  23.   end;

Note how B goes first in the second two operators. Tested both of these and again, they work fine. Nothing declared as packed, also, because like I said before as far as I can tell it has absolutely no effect on anything in this case.
Title: Re: AVX and SSE support question
Post by: Thaddy on November 20, 2017, 05:50:08 pm
Nothing declared as packed, also, because like I said before as far as I can tell it has absolutely no effect on anything in this case.
As far as you can tell?... Always use packed when you want to defeat the compiler by using inline assembler.... Otherwise you are in trouble before you know it.
Anyway, I will check pure pascal against its assembler output myself, just curious....
Title: Re: AVX and SSE support question
Post by: BeanzMaster on November 20, 2017, 10:32:41 pm
Well, apart from the inconsequential command line.....< >:D>
Your problem is with V3.
It would be really helpful if you examine the actual assembler output. So: -a option and examine .s

Also note you need to pack both arrays and records...

Thanks on your advises i packed arrays and record. I also check the .s file the generated code is the same  :D

You just need to invert A and B in the assembler for subtraction and division. Basically it should look like this:

Code: Pascal  [Select][+][-]
  1.   class operator TXMFloat4.-(constref A, B: TXMFloat4): TXMFloat4; assembler;
  2.   asm
  3.     VMOVAPS XMM0,[B]
  4.     VMOVAPS XMM1,[A]
  5.     VSUBPS XMM0,XMM1, XMM0
  6.     VMOVAPS [RESULT], XMM0
  7.   end;
  8.  

Note how B goes first in the second two operators. Tested both of these and again, they work fine. Nothing declared as packed, also, because like I said before as far as I can tell it has absolutely no effect on anything in this case.
Yes it was the problem, the inversion, but you tricks is not the real solution ( don't work with  the 2nd overloaded operators (V:TheVector;F:Single)
I've made some research on instructions and test So the good is :
[c]VSUBPS XMM0,XMM0, XMM1[/c] where the 1st param is the result and not the 3rd as I thought.
Now the results are ok

Thanks


Title: Re: AVX and SSE support question
Post by: Nitorami on November 20, 2017, 11:12:42 pm
I took this assembler code and tried to make a small benchmark. The results seemed ok first, although performance was not significantly better than with native FPC code... but when I start modifying code parts outside the assembler routines, I get access violations whcih are reproducible but appear random. E.g. the below works if calling subtract() from within a loop but otherwise it crashes... it also crashes when printing the results after the subtraction. This is rather dubious, and I guess there may be a bit more to consider when using assembler.


Code: Pascal  [Select][+][-]
  1. {$ASMMODE INTEL}
  2. uses sysutils;
  3. type float4 = packed array [0..3] of single;
  4.  
  5.  
  6. function Subtract (constref A, B: float4): float4; assembler; inline;
  7. asm
  8.   VMOVAPS XMM0,[B]
  9.   VMOVAPS XMM1,[A]
  10.   VSUBPS XMM0,XMM1, XMM0
  11.   VMOVAPS [RESULT], XMM0
  12. end;
  13.  
  14.  
  15. var c: float4;
  16.     n: integer;
  17. const a : float4 = (1,2,3,4);
  18. const b : float4 = (5,6,7,8);
  19.  
  20. begin
  21. //  c := subtract (a,b);  //outside the loop -> crash
  22. // for n := 1 to 1 do  c := Subtract (a,b); //in the loop it does not crash
  23. end.
  24.  
  25.  
Title: Re: AVX and SSE support question
Post by: Akira1364 on November 20, 2017, 11:31:07 pm
Yes it was the problem, the inversion, but you tricks is not the real solution ( don't work with  the 2nd overloaded operators (V:TheVector;F:Single)
I've made some research on instructions and test So the good is :
[c]VSUBPS XMM0,XMM0, XMM1[/c] where the 1st param is the result and not the 3rd as I thought.
Now the results are ok
Thanks

Yeah, that was just a quick suggestion and certainly wouldn't work if the second parameter was a single floating-point value instead of another vector. Inverting it the way you're doing it above is definitely a better all-around solution.

Also, @Nitorami: your code works fine for me, both in and outside of the loop.
Title: Re: AVX and SSE support question
Post by: BeanzMaster on November 21, 2017, 12:12:03 am
Continuing my research i've added this code for SSE :

Code: Pascal  [Select][+][-]
  1. Const
  2.   NullVector4f : TGLZVector4f =(v:(0,0,0,0)); // or =(x:0;y:0;z:0;w:0);
  3.  
  4.  
  5. function TGLZVector4f.Negate :TGLZVector4f;assembler;
  6. asm
  7.     movups xmm1,[RCX]    // RCX = Self
  8.     movups xmm0,[NullVector4f]
  9.     subps xmm0,xmm1
  10.     movups [Result],xmm0 //RDX = Result
  11. End;


Sometime it compile but give wrong result sometime is good but not very very very often and most of time, it just stop with SIGSEGV on the
line : movups xmm0,[NullVector4f]  >:D

And I also have this message that appears, sometime :
project1.lpr(22,0) Warning: Object file "unit1.o" contains 32-bit absolute relocation to symbol ".data.n_tc_$unit1_$$_nullvector4f". I don't understand

For info i'm using Lazarus 1.8rc4 64bits on windows 10
Title: Re: AVX and SSE support question
Post by: Akira1364 on November 21, 2017, 01:54:23 am
Negation is also an overloadable operator that uses the same symbol as subtraction, by the way.
 
You don't have to worry about it conflicting with the subtraction overload either, as the compiler recognizes that they're not the same thing since they have different numbers of parameters. Here's the SSE and AVX versions of it:

Code: Pascal  [Select][+][-]
  1.   class operator TGLZVector4F.-(constref A: TGLZVector4F): TGLZVector4F; assembler; //SSE
  2.   asm
  3.     MOVAPS XMM1,[A]
  4.     MOVAPS XMM0,[NullVector4F]
  5.     SUBPS XMM0,XMM1
  6.     MOVAPS [Result],XMM0
  7.   end;
  8.  
  9.   class operator TGLZVector4F.-(constref A: TGLZVector4F): TGLZVector4F; assembler; //AVX
  10.   asm
  11.     VMOVAPS XMM1,[A]
  12.     VMOVAPS XMM0,[NullVector4F]
  13.     VSUBPS XMM0,XMM0,XMM1
  14.     VMOVAPS [Result],XMM0
  15.   end;
  16.  
  17.   //So to use these you would obviously just do B := -A, or V2 := -V1 or whatever

Both of those work fine for me, with no compiler warnings or any crashes/invalid output after running them a whole bunch of times. Again, I'm using the aligned versions of the "MOV" functions as opposed to the unaligned ones.
Title: Re: AVX and SSE support question
Post by: BeanzMaster on November 21, 2017, 02:26:41 pm
Hi, Thanks Akira, I did not think about it but it's not resolve the problem on my pc i have always a SIGSEGV and this message :
project1.lpr(22,0) Warning: Object file "unit1.o" contains 32-bit absolute relocation to symbol ".data.n_tc_$unit1_$$_nullvector4f".

Something is wrong with my configuration
Title: Re: AVX and SSE support question
Post by: marcov on November 21, 2017, 03:16:07 pm
Hi, Thanks Akira, I did not think about it but it's not resolve the problem on my pc i have always a SIGSEGV and this message :
project1.lpr(22,0) Warning: Object file "unit1.o" contains 32-bit absolute relocation to symbol ".data.n_tc_$unit1_$$_nullvector4f".

Something is wrong with my configuration

Try

Code: Pascal  [Select][+][-]
  1.   MOVAPS XMM0,[RIP+NullVector4F]
Title: Re: AVX and SSE support question
Post by: BeanzMaster on November 21, 2017, 03:44:07 pm

Try

Code: Pascal  [Select][+][-]
  1.   MOVAPS XMM0,[RIP+NullVector4F]

Thanks Marcov the  RIP-Relative Addressing solve the problem. But Why at Akira seems to works without ?
Title: Re: AVX and SSE support question
Post by: Akira1364 on November 21, 2017, 08:19:20 pm
Not sure why they both consistently work here (or why I don't get any warnings... enabled all the verbosity settings and still nothing.) Could be any number of things (CPU differences/compiler version differences/e.t.c.)

Also the "rip-relative" addressing style didn't cross my mind before, but it definitely is a safer way overall so I'd probably just stick with that for stuff like the negation operator.
Title: Re: AVX and SSE support question
Post by: Nitorami on November 21, 2017, 08:56:19 pm
On my environments (win7 and win10, 32bit) I sometimes (but then consistently) get access violations with these routines. It seems to depend, probably amongst other things, on the settings of optimisation level2 and regvar. While O2 is known for some bugs, I had no problems with regvar so far. Therefore I think that something serious is wrong with these assembler routines - saving registers, calling convention, stack issues, alignment - whatever.
I experienced similar problems years ago, when I thought I could optimise my code using selfmade assembler routines... which suddenly stopped working when I changed code at entirely different places in the program. My lesson was - if you do not know exactly what you are doing, don't use assembler.   
Title: Re: AVX and SSE support question
Post by: Akira1364 on November 21, 2017, 09:39:10 pm
99% of the time I'd agree with you, as in most cases assembly hand-written by people is unlikely to be better or even close to as good as what FPC produces at high optimization levels.

However this is sort of an edge case where it's specifically known that the compiler isn't currently capable of generating working "vectorized" ASM from Pascal at any setting. They're also extremely simple 4-line methods that are pretty much in line with what GCC would generate at Ofast or MSVC would generate with the vectorcall extension turned on, so there's not a whole lot that can (or should) go wrong.
Title: Re: AVX and SSE support question
Post by: engkin on November 21, 2017, 10:06:56 pm
I guess that a problem is seen sometimes on 32bit code, and not seen on 64bit code.

Akira, are you testing 64bit code?

Nitorami, 32bit?
Title: Re: AVX and SSE support question
Post by: Akira1364 on November 22, 2017, 12:37:37 am
Yeah, 64 bit compiled with trunk FPC. CPU is an i7-4790k.
Nitorami actually already said they were using 32-bit in their last post, by the way.
Title: Re: AVX and SSE support question
Post by: marcov on November 22, 2017, 10:00:00 am
I use FPC sse/avx assembler in (64-bit) production applications. Never found a problem that I couldn't explain by looking at the generated code.

-Sv is different and buggy, but assembler is usually straightforward.
Title: Re: AVX and SSE support question
Post by: BeanzMaster on November 23, 2017, 12:54:40 am
Hi to all, thanks for the explanations

I've made a little test app with 3 clones of the same record. One for Pure Pascal, One for SSE, and the 3rd for AVX
I've included the basic vectors functions (Length, Distance, DotProduct, CrossProduct, Normalize,....
I've putted some comment
I'll suggest you to see specialy the SSE DotProduct function in comment you'll find 3 others versions (SSE1, SSE2, SSE3 and SS4 tests)
This is just a test so some functions are not optimized yet  ;D
The App compile without any exceptions or compilater's warnings  8-)

In order to make a comparison between our pc and configuration

My PC
- CPU                            : AMD A10-7870K Radeon R7, 12 Compute Cores 4C+8G
- Supported Instructions : MMX (+), SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, SSE4A, AMD 64, NX, VMX, AES, AVX, FMA3, FMA4
- OS                              : Windows 10 64-bit
- DEV                            : Lazarus 1.8rc4 / FPC 3.0.2

All suggestions are welcome.

Notice : the unit is for 64-bit and Windows, for others see the comment on the top of the unit

Title: Re: AVX and SSE support question
Post by: dicepd on November 23, 2017, 10:07:49 am
Hi Jerome,

Nice work with the Vector lib, thats the cleanest Pascal code I have come across for vector math, demonstates what advanced records can really do.

Tested on Win7 64 on both my AMD and Intel desktops with no problems, plugging the number ranges I use into some of the Vectors thankfully I see no loss of precision using the rsqrtps in normalize.

Loaded it into a Linux VM and I have made your nice neat code all messy with some Unix defines  (I selected Unix for now as I am about to upgrade my FreeBSD boxen to test there). This is still for 64 bit linux not tested in 32bit as I have no 32 bit OSes any more.

Peter
Title: Re: AVX and SSE support question
Post by: BeanzMaster on November 23, 2017, 02:31:51 pm
Hi Jerome,

Nice work with the Vector lib, thats the cleanest Pascal code I have come across for vector math, demonstates what advanced records can really do.

Tested on Win7 64 on both my AMD and Intel desktops with no problems, plugging the number ranges I use into some of the Vectors thankfully I see no loss of precision using the rsqrtps in normalize.

Loaded it into a Linux VM and I have made your nice neat code all messy with some Unix defines  (I selected Unix for now as I am about to upgrade my FreeBSD boxen to test there). This is still for 64 bit linux not tested in 32bit as I have no 32 bit OSes any more.

Peter

Hi Peter,

Thanks. It's cool you've added and test for Unix  8)
Now i'll can add more functions like min, max, clamp, refract, reflect.... and beginning to work with array.
Now i'll can add this to my project and improve GLScene for the next big update  ;)
If someone could test under linux 32bit and mac to, it will be very helpfull

cheers
Title: Re: AVX and SSE support question
Post by: dicepd on November 23, 2017, 07:23:28 pm
Ok finally got an almost working system on FreeBSD, just got to get gdb working, but I did manage to test the app. Numbers are fine but get

GLZVectorMath.pas(478,38) Warning: Exported/global symbols should be accessed via the GOT

from these lines   
Code: Pascal  [Select][+][-]
  1.  movups xmm0,[RIP+cNullSSEVector4f]
  2.  vmovups xmm0,[RIP+cNullAVXVector4f]
  3.  

whatever that warning means.
Title: Re: AVX and SSE support question
Post by: schuler on November 23, 2017, 07:52:43 pm
 :) Hello Pascal Lovers  :)

Decided to share a piece of FPC source code that I find clever in the hope it's helpful:
Code: Pascal  [Select][+][-]
  1. function CompareByte(Const buf1,buf2;len:SizeInt):SizeInt; assembler; nostackframe;
  2. { win64: rcx buf, rdx buf, r8 len
  3.   linux: rdi buf, rsi buf, rdx len }
  4. asm
  5. {$ifndef win64}
  6.     mov    %rdx, %r8
  7.     mov    %rsi, %rdx
  8.     mov    %rdi, %rcx
  9. {$endif win64}

Above code deals with different calling conventions.

I saw the code about "negate" somewhere, instead of using a 4 elements constant array, BROADCAST can be used. This example has been copied from uvolume.pas unit:
Code: Pascal  [Select][+][-]
  1.   mov rdx, FillOpPtr
  2.   VBROADCASTSS ymm0, [rdx]

In the example above, all 8 elements will be filled with the single value pointed by FillOpPtr.

 :) Wish everyone happy coding :)

Title: Re: AVX and SSE support question
Post by: dicepd on November 23, 2017, 08:09:09 pm
Looking at the number of self registers and combos ifdefs could get quite messy very quickly. I just tested one alternative that works using macros at the top of the file.

Code: Pascal  [Select][+][-]
  1. {$MACRO ON}
  2. {$ifdef UNIX}
  3.   {$ifdef CPU64}
  4.     {$define ASM_VMOVUPS_SELF:=asm vmovups xmm0,[RDI]}
  5.     {$define ASM_VMOVAPS_SELF:=asm vmovaps xmm0,[RDI]}
  6.     {$define ASM_MOVUPS_SELF:=asm movups xmm0,[RDI]}
  7.     {$define ASM_MOVAPS_SELF:=asm movaps xmm0,[RDI]}
  8.   {$else}
  9.     {$define ASM_VMOVUPS_SELF:=asm vmovups xmm0,[EDI]}
  10.     {$define ASM_VMOVAPS_SELF:=asm vmovaps xmm0,[EDI]}
  11.     {$define ASM_MOVUPS_SELF:=asm movups xmm0,[EDI]}
  12.     {$define ASM_MOVAPS_SELF:=asm movaps xmm0,[EDI]}
  13.   {$endif}
  14. {$else}
  15.   {$ifdef CPU64}
  16.     {$define ASM_VMOVUPS_SELF:=asm vmovups xmm0,[RCX]}
  17.     {$define ASM_VMOVAPS_SELF:=asm vmovaps xmm0,[RCX]}
  18.     {$define ASM_MOVUPS_SELF:=asm movups xmm0,[RCX]}
  19.     {$define ASM_MOVAPS_SELF:=asm movaps xmm0,[RCX]}
  20.   {$else}
  21.     {$define ASM_VMOVUPS_SELF:=asm vmovups xmm0,[ECX]}
  22.     {$define ASM_VMOVAPS_SELF:=asm vmovaps xmm0,[ECX]}
  23.     {$define ASM_MOVUPS_SELF:=asm movups xmm0,[ECX]}
  24.     {$define ASM_MOVAPS_SELF:=asm movaps xmm0,[ECX]}
  25.   {$endif}
  26. {$endif}    
  27.  

Then the routines would look something like this:

Code: Pascal  [Select][+][-]
  1. function TGLZAVXVector4f.DotProduct(constref A: TGLZAVXVector4f):Single;assembler;
  2.   ASM_VMOVUPS_SELF
  3.   vmovups xmm1, [A]
  4.   vdpps xmm0, xmm0, xmm1, 01110001b //or $F1
  5.   movlps [Result], xmm0
  6. end;
  7.  

Advantage ifdefs removed from bulk of code.
DisAdvantage Asm colouring does not work.

Tested as working here, but just as a suggestion. btw macros will not otherwise work inside an asm section.

Peter
Title: Re: AVX and SSE support question
Post by: BeanzMaster on November 23, 2017, 10:02:05 pm
@schuler thanks it will be usefull for procedure

@Peter : Macro is good solution i think to, but not work with me  >:( but prefer "old style" due need add asm keyword in the macro  :-[

With the problem :
Code: Pascal  [Select][+][-]
  1. movups xmm0,[RIP+cNullSSEVector4f]
  2. vmovups xmm0,[RIP+cNullAVXVector4f

It's what we talk some messages above with Akira and Marcov, so try by just removed the RIP register.
eg :
Code: Pascal  [Select][+][-]
  1. movups xmm0,[cNullSSEVector4f]

 
Title: Re: AVX and SSE support question
Post by: dicepd on November 23, 2017, 11:16:01 pm
It would appear that the only way I can see to get rid of the warning and make the routines safe for BSD and osx is to use routine local consts. A bit of reading and global const and position independent code which could be randomised by these OSes does not sit well together and could result in more cycles trying to work out where the data segment actually is in memory to retrieve the global.

Peter.

 
Title: Re: AVX and SSE support question
Post by: dicepd on November 23, 2017, 11:32:03 pm

@Peter : Macro is good solution i think to, but not work with me  >:(


Try this...
Title: Re: AVX and SSE support question
Post by: BeanzMaster on November 25, 2017, 01:18:32 am
Hi i try to implement some others operator like =,<=, < ect...

For the beginning, i try =

Code: Pascal  [Select][+][-]
  1. class operator TGLZSSEVector4f.= (constref A, B: TGLZSSEVector4f): boolean; assembler;
  2. asm
  3.   movups xmm0,[A]  
  4.   movups xmm1,[B]
  5.   {$IFDEF USE_ASM_SSE_4}
  6.   cmpeqps xmm0,xmm1
  7.   ptest    xmm0, xmm1
  8.   jz @no_differences
  9.   mov [RESULT],FALSE
  10.   jmp @END_SSE
  11.   {$ELSE}
  12.   cmpeqps  xmm0, xmm1    // 0:A and B are ordered and equal.  -1:not ieee_equal.
  13.   //andnps    xmm0, xmm1
  14.   movmskps  eax, xmm0
  15.   test      eax, eax
  16.   //or eax, eax
  17.   jz @no_differences
  18.   mov [RESULT],FALSE
  19.   jmp @END_SSE
  20.   {$ENDIF}
  21.   @no_differences:
  22.   mov [RESULT],TRUE
  23.   @END_SSE:
  24. end;

But this don't work (both for SSE and SSE4) it always return TRUE
with for example V1 = v1  and V1 = V2 (v1 and v2 are 2 differents vectors of course)

Must be add some (or just one PUSH/POP) Help is welcome, perhaps i don't understand something  :-[ with the Movmskps or ptest

Thanks
Title: Re: AVX and SSE support question
Post by: BeanzMaster on November 26, 2017, 04:01:48 pm
Hello, so i found solution for comparing Vector use CMPPS with flag instead of cmpXXps instructions

I've also optimized and corrected SSE/SSE2 functions
I've added SSE3/SS4 support for some functions and synchronized with AVX
I've added bunch of functions like, min, max, clamp, negate, lerp, anglecosine, reflect ect...
I've added some procedure for doing chain computing (not tested yet)

Now i have a very strange bug with the MOVSS instruction

Take a look :
with SSE i have this 2 functions. They're raise warning (see comment in)

Code: Pascal  [Select][+][-]
  1. function TGLZSSEVector4f.Combine2(constref V2: TGLZSSEVector4f; constref F1:Single;Constref F2: Single): TGLZSSEVector4f;assembler;
  2. asm
  3. {$ifdef UNIX}
  4.   {$ifdef CPU64}
  5.      movups xmm0,[RDI]
  6.   {$else}
  7.      movups xmm0,[EDI]
  8.   {$endif}
  9. {$else}
  10.   {$ifdef CPU64}
  11.      movups xmm0,[RCX]
  12.   {$else}
  13.      movups xmm0,[ECX]
  14.   {$endif}
  15. {$endif}
  16.   movups xmm1, [V2]
  17.   movss xmm2, [F1]
  18.   movss xmm3, [F2]   //---> WARNING GLZVectorMath.pas(1869,18) Warning: Check size of memory operand "movss: memory-operand-size is 64 bits, but expected [128 bits]"
  19.  
  20.   shufps xmm2, xmm2, $00 // replicate
  21.   shufps xmm3, xmm3, $00 // replicate
  22.  
  23.   mulps xmm0, xmm2  // Self * F1
  24.   mulps xmm1, xmm3  // V2 * F2
  25.  
  26.   addps xmm0, xmm1  // (Self * F1) + (V2 * F2)
  27.  
  28.   andps xmm0, [RIP+cSSE_MASK_NO_W]
  29.   movups [RESULT], xmm0
  30. end;  

Code: Pascal  [Select][+][-]
  1. function TGLZSSEVector4f.Combine3(constref V2, V3: TGLZSSEVector4f; constref F1, F2, F3: Single): TGLZSSEVector4f;  assembler;
  2. asm
  3. {$ifdef UNIX}
  4.   {$ifdef CPU64}
  5.      movups xmm0,[RDI]
  6.   {$else}
  7.      movups xmm0,[EDI]
  8.   {$endif}
  9. {$else}
  10.   {$ifdef CPU64}
  11.      movups xmm0,[RCX]
  12.   {$else}
  13.      movups xmm0,[ECX]
  14.   {$endif}
  15. {$endif}
  16.  
  17.   movups xmm1, [V2]
  18.   movups xmm4, [V3]
  19.  
  20.   movss xmm2, [F1] //---> WARNING GLZVectorMath.pas(1902,18) Warning: Check size of memory operand "movss: memory-operand-size is 64 bits, but expected [128 bits]"
  21.   movss xmm3, [F2] //---> WARNING GLZVectorMath.pas(1903,18) Warning: Check size of memory operand "movss: memory-operand-size is 64 bits, but expected [128 bits]"
  22.   movss xmm5, [F3] //---> WARNING GLZVectorMath.pas(1904,18) Warning: Check size of memory operand "movss: memory-operand-size is 64 bits, but expected [128 bits]"
  23.  
  24.   shufps xmm2, xmm2, $00 // replicate
  25.   shufps xmm3, xmm3, $00 // replicate
  26.   shufps xmm5, xmm5, $00 // replicate
  27.  
  28.   mulps xmm0, xmm2 // Self * F1
  29.   mulps xmm1, xmm3 // V2 * F2
  30.   mulps xmm4, xmm5 // V3 * F3
  31.  
  32.   addps xmm0, xmm1 // (Self * F1) + (V2 * F2)
  33.   addps xmm0, xmm4 // ((Self * F1) + (V2 * F2)) + (V3 * F3)
  34.  
  35.   andps xmm0, [RIP+cSSE_MASK_NO_W]
  36.   movups [RESULT], xmm0
  37. end;  

and now the AVX, it RAISE ERROR (same for Combine3) :

Code: Pascal  [Select][+][-]
  1. function TGLZAVXVector4f.Combine2(constref V2: TGLZAVXVector4f; Constref F1, F2: Single): TGLZAVXVector4f;assembler;
  2. asm
  3. {$ifdef UNIX}
  4.   {$ifdef CPU64}
  5.      vmovups xmm0,[RDI]
  6.   {$else}
  7.      vmovups xmm0,[EDI]
  8.   {$endif}
  9. {$else}
  10.   {$ifdef CPU64}
  11.      vmovups xmm0,[RCX]
  12.   {$else}
  13.      vmovups xmm0,[ECX]
  14.   {$endif}
  15. {$endif}
  16.   vmovss xmm2, [F1]
  17.   vmovss xmm3, [F2]  //--> ERROR : GLZVectorMath.pas(3465,3) Error: Invalid register used in memory reference expression: "xmm3"
  18.  
  19.   vmovups xmm1, [V2]
  20.  
  21.   vshufps xmm2, xmm2, xmm2, $00 // replicate
  22.   vshufps xmm3, xmm3, xmm3, $00 // replicate
  23.  
  24.   vmulps xmm0, xmm0, xmm2  // Self * F1
  25.   vmulps xmm1, xmm1, xmm3  // V2 * F2
  26.  
  27.   vaddps xmm0, xmm0, xmm1  // (Self * F1) + (V2 * F2)
  28.  
  29.   vandps xmm0, xmm0, [RIP+cSSE_MASK_NO_W]
  30.   vmovups [RESULT], xmm0
  31. end;

And with there two functions in SSE an AVX, NO WARNING / NO ERROR

Code: Pascal  [Select][+][-]
  1. function TGLZSSEVector4f.Combine(constref V2: TGLZSSEVector4f; constref F1: Single): TGLZSSEVector4f;assembler;
  2. asm
  3. {$ifdef UNIX}
  4.   {$ifdef CPU64}
  5.      movups xmm0,[RDI]
  6.   {$else}
  7.      movups xmm0,[EDI]
  8.   {$endif}
  9. {$else}
  10.   {$ifdef CPU64}
  11.      movups xmm0,[RCX]
  12.   {$else}
  13.      movups xmm0,[ECX]
  14.   {$endif}
  15. {$endif}
  16.   movups xmm1, [V2]
  17.   movss xmm2, [F1]
  18.   shufps xmm2, xmm2, $00 // replicate
  19.  
  20.   mulps xmm1, xmm2 //V2*F1
  21.   addps xmm0, xmm1 // Self + (V2*F1)
  22.  
  23.   andps xmm0, [RIP+cSSE_MASK_NO_W]
  24.   movups [RESULT], xmm0
  25. end;  
  26.  
  27. function TGLZAVXVector4f.Combine(constref V2: TGLZAVXVector4f; constref F1: Single): TGLZAVXVector4f;assembler;
  28. asm
  29. {$ifdef UNIX}
  30.   {$ifdef CPU64}
  31.      vmovups xmm0,[RDI]
  32.   {$else}
  33.      vmovups xmm0,[EDI]
  34.   {$endif}
  35. {$else}
  36.   {$ifdef CPU64}
  37.      vmovups xmm0,[RCX]
  38.   {$else}
  39.      vmovups xmm0,[ECX]
  40.   {$endif}
  41. {$endif}
  42.   vmovups xmm1, [V2]
  43.   vmovss xmm2, [F1]
  44.   vshufps xmm2, xmm2, xmm2, $00 // replicate
  45.  
  46.   vmulps xmm1, xmm1, xmm2 //V2*F1
  47.   vaddps xmm0, xmm0, xmm1 // Self + (V2*F1)
  48.  
  49.   vandps xmm0, xmm0, [RIP+cSSE_MASK_NO_W]
  50.   vmovups [RESULT], xmm0
  51. end;
  52.  

So it's seem have a problem with the compilator, it's very strange because with there 2 others functions no problems

Code: Pascal  [Select][+][-]
  1. function TGLZSSEVector4f.Clamp(constref AMin, AMax: Single): TGLZSSEVector4f; assembler;
  2. asm
  3. {$ifdef UNIX}
  4.   {$ifdef CPU64}
  5.      movups xmm0,[RDI]
  6.   {$else}
  7.      movups xmm0,[EDI]
  8.   {$endif}
  9. {$else}
  10.   {$ifdef CPU64}
  11.      movups xmm0,[RCX]
  12.   {$else}
  13.      movups xmm0,[ECX]
  14.   {$endif}
  15. {$endif}
  16.   movss xmm2, [AMin]
  17.   movss xmm3, [AMax]
  18.   shufps xmm2, xmm2, $00 // Replicate AMin
  19.   shufps xmm3, xmm3, $00 // Replicate AMax
  20.   maxps  xmm0, xmm2
  21.   minps  xmm0, xmm3
  22.   movups [Result], xmm0
  23. end;
  24.  
  25. function TGLZAVXVector4f.Clamp(constref AMin, AMax: Single): TGLZAVXVector4f; assembler;
  26. asm
  27. {$ifdef UNIX}
  28.   {$ifdef CPU64}
  29.      vmovups xmm0,[RDI]
  30.   {$else}
  31.      vmovups xmm0,[EDI]
  32.   {$endif}
  33. {$else}
  34.   {$ifdef CPU64}
  35.      vmovups xmm0,[RCX]
  36.   {$else}
  37.      vmovups xmm0,[ECX]
  38.   {$endif}
  39. {$endif}
  40.   vmovss xmm2, [AMin]
  41.   vmovss xmm3, [AMax]
  42.   vshufps xmm2, xmm2, xmm2, $00
  43.   vshufps xmm3, xmm3, xmm3, $00
  44.   vmaxps  xmm0, xmm0, xmm2
  45.   vminps  xmm0, xmm0, xmm3
  46.   vmovups [Result], xmm0
  47. end;

You can try with the updated sample project attached here

Note i've commented The functions Combine2 and Combine3 so uncomment for testing

By defaut it use SSE/SSE2 code. See Directive on top of the unit to change

Thanks in advance for your tests and help



Title: Re: AVX and SSE support question
Post by: dicepd on November 29, 2017, 02:10:20 am
Hi Jerome,

Just got round to testing this in Linux. I have just moved my main dev box over to Linux for good now, so that took a couple of days. BTW this box with decent drivers solves all the control issues I was having . GLScene now works just fine for me.

Anyway back to the testing, as tested it just kept crashing on
 
Code: Pascal  [Select][+][-]
  1. andps xmm0, [RIP+cSSE_MASK_NO_W]
and similar lines. My first solution was to read the generated native code which seemed to do a move first as in

Code: Pascal  [Select][+][-]
  1.  movups xmm3, [RIP+cSSE_MASK_NO_W]
  2.  andps xmm0, xmm3

I tried this and while it worked I was not happy having to add another instruction, as this is meant to be an optimsed lib.

A bit more googling and reading and a better solution came to light. It would seem that you are getting your consts aligned correctly, while in Linux they were not and therefore generating an error. So I rolled back the first set of changes and added

Code: Pascal  [Select][+][-]
  1. {$CODEALIGN CONSTMIN=16}  

This worked fine.

Peter
Title: Re: AVX and SSE support question
Post by: dicepd on November 29, 2017, 04:02:34 am
Testing the Combine2 and Combine3 I am afraid I can't offer any help there as they all just work for me with no warnings or errors.

One other thing I looked at again was the Exported/global symbols should be accessed via the GOT warning, now so many they are annoying.

One solution is to move the consts to the Implementation section as they are not going to be required by the end user as I see you have added the TGLZVector type definition and exported consts can be declared using this type.

And the last thing before I go back to my problems, you are trashing the stack where you want a single return by using a [ v]mov[lua]ps instruction, should only need a movss to return a single.
 
Peter
Title: Re: AVX and SSE support question
Post by: BeanzMaster on November 29, 2017, 04:28:14 pm
Hi Peter thanks for testing

For Combine2 and Combine3, with you the results are corrects ?

It's very strange i tried to change order of args like :

  function TGLZSSEVector4f.Combine2(constref F1, F2: Single;constref V2: TGLZSSEVector4f): TGLZSSEVector4f;
  function TGLZAVXVector4f.Combine3(constref F1, F2, F3: Single;constref V2, V3: TGLZAVXVector4f ): TGLZAVXVector4f;

Now the compiler only report in combine3 function:  >:(
GLZVectorMath_NEW.pas(2167,18) Warning: Check size of memory operand "movss: memory-operand-size is 64 bits, but expected [128 bits]" on this line :  movss xmm5, [F3] with sse
GLZVectorMath_NEW.pas(3731,3) Error: Asm: [vmovss xmmreg,mem128] invalid combination of opcode and operands  on this line :  vmovss xmm5, [F3] with AVX

No warning anymore in Combine2 but the results are wrong relative to the native function  :'(

I've also change the order of instructions in Combine3 AVX to

Code: Pascal  [Select][+][-]
  1. function TGLZAVXVector4f.Combine3(constref F1, F2, F3: Single;constref V2, V3: TGLZAVXVector4f ): TGLZAVXVector4f;  assembler;
  2. asm
  3. {$ifdef UNIX}
  4.   {$ifdef CPU64}
  5.      vmovups xmm0,[RDI]
  6.   {$else}
  7.      vmovups xmm0,[EDI]
  8.   {$endif}
  9. {$else}
  10.   {$ifdef CPU64}
  11.      vmovups xmm0,[RCX]
  12.   {$else}
  13.      vmovups xmm0,[ECX]
  14.   {$endif}
  15. {$endif}
  16.   vmovss xmm2, [F1]
  17.   vshufps xmm2, xmm2, xmm2, $00 // replicate
  18.   vmulps xmm0, xmm0, xmm2 // Self * F1
  19.  
  20.   vmovups xmm1, [V2]
  21.   vmovss xmm3, [F2]
  22.   vshufps xmm3, xmm3, xmm3, $00 // replicate
  23.   vmulps xmm1, xmm1, xmm3 // V2 * F2
  24.  
  25.   vaddps xmm0, xmm0, xmm1 // (Self * F1) + (V2 * F2)
  26.  
  27.   vmovups xmm4, [V3]
  28.   movss xmm5, [F3]
  29.   vshufps xmm5, xmm5, xmm5, $00 // replicate
  30.   vmulps xmm4, xmm4, xmm5 // V3 * F3
  31.  
  32.   vaddps xmm0, xmm0, xmm4 // ((Self * F1) + (V2 * F2)) + (V3 * F3)
  33.  
  34.   vandps xmm0, xmm0, [RIP+cSSE_MASK_NO_W]
  35.   vmovups [RESULT], xmm0
  36. end;
  37.  

and now i've GLZVectorMath_NEW.pas(3741,18) Warning: Check size of memory operand "movss: memory-operand-size is 64 bits, but expected [128 bits]" on this line :  vmovss xmm5,[F3] instead of the error message. If a Guru come here, perhaps he will can give an explication. Because here i'm totally lost.  :'(

I've also tried this,(no error, no warning)  but result is the same as Combine2 and not correct relative to the native function  >:D it's like the third operation (V3*F3) is not compute or is set to ZERO  >:(

Code: Pascal  [Select][+][-]
  1. function TGLZAVXVector4f.Combine3(constref F1, F2, F3: TGLZAVXVector4f;constref V2, V3: TGLZAVXVector4f ): TGLZAVXVector4f;  assembler;
  2. asm
  3. {$ifdef UNIX}
  4.   {$ifdef CPU64}
  5.      vmovups xmm0,[RDI]
  6.   {$else}
  7.      vmovups xmm0,[EDI]
  8.   {$endif}
  9. {$else}
  10.   {$ifdef CPU64}
  11.      vmovups xmm0,[RCX]
  12.   {$else}
  13.      vmovups xmm0,[ECX]
  14.   {$endif}
  15. {$endif}
  16.   vmovups xmm2, [F1]
  17.   vmulps xmm0, xmm0, xmm2 // Self * F1
  18.  
  19.   vmovups xmm1, [V2]
  20.   vmovups xmm3, [F2]
  21.   vmulps xmm1, xmm1, xmm3 // V2 * F2
  22.  
  23.   vmovups xmm4, [V3]
  24.   vmovups xmm5, [F3]
  25.   vmulps xmm4, xmm4, xmm5 // V3 * F3
  26.  
  27.   vaddps xmm0, xmm0, xmm1 // (Self * F1) + (V2 * F2)
  28.   vaddps xmm0, xmm0, xmm4 // ((Self * F1) + (V2 * F2)) + (V3 * F3)
  29.  
  30.   vandps xmm0, xmm0, [RIP+cSSE_MASK_NO_W]
  31.   vmovups [RESULT], xmm0
  32. end;  
  33.  
So i think is a compiler bug under Windows 64 bit (not tested in 32bit) but why just here ?????

Peter for

Code: Pascal  [Select][+][-]
  1. andps xmm0, [RIP+cSSE_MASK_NO_W]

Have you tried without the RIP ?

Title: Re: AVX and SSE support question
Post by: dicepd on November 29, 2017, 04:50:58 pm
First off if I remove RIP Test crashes.

Next here is the screenshot of my results. I think this may be down to alignment as with the consts although my recent reading matter states that the stack in 64 bit OS is allready 16 byte aligned and as I am getting the correct result the code as of the last zip is probably correct, just not getting right numbers from the stack.

I will try to play with the code I currently have in Windows and see what happens here.
Title: Re: AVX and SSE support question
Post by: dicepd on November 29, 2017, 07:11:45 pm
Ok I got a VM of win7 64 bit up and am stepping through combine3 comparing registers from a linux and a win 7,

vectors load in fine in both OS



Load F1 into xmm2
 1.5 in Linux
{v4_float = {1.5, 0, 0, 0}, v2_double = {5.2842668622670356e-315, 0}, v16_int8 = {0, 0, -64, 63, 0 <repeats 12 times>}, v8_int16 = {0, 16320, 0, 0, 0, 0, 0, 0}, v4_int32 = {1069547520, 0, 0, 0}, v2_int64 = {1069547520, 0}, uint128 = 1069547520}
in windows
{v4_float = {2.40490394e-038, 0, 0, 0}, v2_double = {8.3840924311424506e-317, 0}, v16_int8 = {120, -17, 2, 1, 0 <repeats 12 times>}, v8_int16 = {-4232, 258, 0, 0, 0, 0, 0, 0}, v4_int32 = {16969592, 0, 0, 0}, v2_int64 = {16969592, 0}, uint128 = 16969592}

Load F2 into xmm3
Linux
{v4_float = {5.5, 0, 0, 0}, v2_double = {5.3619766690650802e-315, 0}, v16_int8 = {0, 0, -80, 64, 0 <repeats 12 times>}, v8_int16 = {0, 16560, 0, 0, 0, 0, 0, 0}, v4_int32 = {1085276160, 0, 0, 0}, v2_int64 = {1085276160, 0}, uint128 = 1085276160}
Windows
{v4_float = {2.4049017e-038, 0, 0, 0}, v2_double = {8.3840884786172839e-317, 0}, v16_int8 = {112, -17, 2, 1, 0 <repeats 12 times>}, v8_int16 = {-4240, 258, 0, 0, 0, 0, 0, 0}, v4_int32 = {16969584, 0, 0, 0}, v2_int64 = {16969584, 0}, uint128 = 16969584}

Load F3 into xmm5
Linux
{v4_float = {6.5999999, 0, 0, 0}, v2_double = {5.3733741064073288e-315, 0}, v16_int8 = {51, 51, -45, 64, 0 <repeats 12 times>}, v8_int16 = {13107, 16595, 0, 0, 0, 0, 0, 0}, v4_int32 = {1087583027, 0, 0, 0}, v2_int64 = {1087583027, 0}, uint128 = 1087583027}
Windows
{v4_float = {2.40489946e-038, 0, 0, 0}, v2_double = {8.3840845260921172e-317, 0}, v16_int8 = {104, -17, 2, 1, 0 <repeats 12 times>}, v8_int16 = {-4248, 258, 0, 0, 0, 0, 0, 0}, v4_int32 = {16969576, 0, 0, 0}, v2_int64 = {16969576, 0}, uint128 = 16969576}

So no wonder you are getting wroing answers. Atm I have not got a clue what is going on but will think about it and play some more.

Peter


Update Removing constref from singles and passing by value on the stack and it works. That is probably why the compiler gave the 64 bits message as against a 32 bit. Single are probably better off passed by value as 32 bits is less than 64 bit pointer.
Title: Re: AVX and SSE support question
Post by: BeanzMaster on November 29, 2017, 11:05:49 pm
Thanks Peter, you make me in the right way, after re-reading your message about the Const so i found now, i have correct results :
1st i've added this :
Code: Pascal  [Select][+][-]
  1. {$CODEALIGN RECORDMIN=16}

EDIT :; I didn't take take care this directive now break many other functions (see attached screenshot 1st without 2nd with see Distance, Length, norm, normalize )

but always having this warning in Combine3 with SSE (no more with Combine2)
GLZVectorMath_NEW.pas(2167,18) Warning: Check size of memory operand "movss: memory-operand-size is 64 bits, but expected [128 bits]" on this line :  movss xmm5, [F3]
so i  changed movss by movlps no more warning but result is always incorrect with Combine3

But now for the AVX Combine2 for the second VMOVSS and in Combine3 for the 3 VMOVSS I've always this error
GLZVectorMath_NEW.pas(3804,3) Error: Asm: [vmovss xmmreg,mem128] invalid combination of opcode and operands but by changing VMOVSS by MOVLPS no more error and result are corrects.  (NOTE : VMOVLPS give the same error as above >:( ) and result is  always incorrect like with the SSE version (for Combine3)
 I don't understand,  this behaviour it's crazy  :'(
Title: Re: AVX and SSE support question
Post by: BeanzMaster on November 29, 2017, 11:24:05 pm
And the second screenshot
Title: Re: AVX and SSE support question
Post by: dicepd on November 30, 2017, 09:57:46 am
Ok I have looked a bit deeper into this and have found the following.

If the compiler puts any of the single refs on the stack then it returns garbage when we access it using movss.

If all the single refs are allocated to registers then it works perfectly.

Now if this is a bug or not I do not know, ( I took one look at 8086 assembly in 1985 and said f@*k this and have tried to stay away from it since. I have done a lot of assembly for other processors with cleaner instruction sets) so am not that up on exact memory addressing syntax, plus having to read Intel as written and gcc style generated gets quite confusing.

So if you check the assembler output of linux, which works

Code: Pascal  [Select][+][-]
  1. .Lc664:
  2.         leaq    -16(%rsp),%rsp
  3. # Var V2 located in register rsi
  4. # Var F1 located in register rdx
  5. # Var F2 located in register rcx
  6. # Var $self located in register rdi
  7. # Temp -16,16 allocated
  8. # Var $result located at rbp-16, size=OS_128
  9.         # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 allocated
  10.  
  11. # [3512] movss xmm3, [F2]  
  12.    movss        (%rcx),%xmm3
  13.  
  14.  

versus windows

Code: Pascal  [Select][+][-]
  1. # Var V2 located in register r8
  2. # Var F1 located in register r9
  3. # Var $self located in register rcx
  4. # Var $result located in register rdx
  5. .seh_endprologue
  6. # Var F2 located at rbp+48, size=OS_64
  7.         # Register rax,rcx,rdx,r8,r9,r10,r11 allocated
  8. .Ll1456:
  9. # [3495] vmovups xmm0,[RCX]
  10.         vmovups (%rcx),%xmm0
  11. .Ll1457:
  12. # [3500] vmovups xmm1, [V2]
  13.         vmovups (%r8),%xmm1
  14. .Ll1458:
  15. # [3502] movss xmm2, [F2]
  16.         movss   48(%rbp),%xmm2
  17. .Ll1459:                                              
  18.  

Maybe someone else has an idea why the latter falls down and puts garbage into the mmx register.

BTW not using constref was a bad idea, as in certain optimisation modes the compiler uses mmx registers for procedure parameters.

Peter
Title: Re: AVX and SSE support question
Post by: marcov on November 30, 2017, 10:04:09 am
You move the value to the stack. You should probably move to the pointer that is ON the stack.

so

Code: Pascal  [Select][+][-]
  1. mov  rax,48(%rbp)  // or whatever free register.
  2. movss (rax),%xmm2

Title: Re: AVX and SSE support question
Post by: dicepd on November 30, 2017, 10:31:55 am
Hi marcov,

I was kind of coming to that solution myself, so we have to be a little inefficient in the case that the compiler has already put the pointer in a register, by copying to another, so that we ensure that we have a pointer in a register in the case that the compiler puts the pointer on the stack.

Peter
Title: Re: AVX and SSE support question
Post by: dicepd on November 30, 2017, 10:53:30 am
@Jerome

Here is  TGLZAVXVector4f.Combine2 reworked to be safe for previous post issues, test this and if it works then quite a bit of code jockeying to do :)

It gave the right answer on my win7 64 box.

Code: Pascal  [Select][+][-]
  1. function TGLZAVXVector4f.Combine2(constref V2: TGLZAVXVector4f; Constref F1: single; constref F2: Single): TGLZAVXVector4f;assembler;
  2. asm
  3. {$ifdef UNIX}
  4.   {$ifdef CPU64}
  5.      vmovups xmm0,[RDI]
  6.   {$else}
  7.      vmovups xmm0,[EDI]
  8.   {$endif}
  9. {$else}
  10.   {$ifdef CPU64}
  11.      vmovups xmm0,[RCX]
  12.   {$else}
  13.      vmovups xmm0,[ECX]
  14.   {$endif}
  15. {$endif}
  16. {$ifdef CPU64}
  17.  
  18.   mov RAX, V2
  19.   vmovups xmm1, [RAX]
  20.  
  21.   mov RAX, F1
  22.   movss xmm2, [RAX]
  23.  
  24.   mov RAX, F2
  25.   movss xmm3, [RAX]
  26.  
  27. {$else}
  28.  
  29.   mov EAX, V2
  30.   vmovups xmm1, [EAX]
  31.  
  32.   mov EAX, F1
  33.   movss xmm2, [EAX]
  34.  
  35.   mov EAX, F2
  36.   movss xmm3, [EAX]
  37. {$endif}
  38.  
  39.  
  40.   vshufps xmm2, xmm2, xmm2, $00 // replicate
  41.   vshufps xmm3, xmm3, xmm3, $00 // replicate
  42.  
  43.   vmulps xmm0, xmm0, xmm2  // Self * F1
  44.   vmulps xmm1, xmm1, xmm3  // V2 * F2
  45.  
  46.   vaddps xmm0, xmm0, xmm1  // (Self * F1) + (V2 * F2)
  47.  
  48.   vandps xmm0, xmm0, [RIP+cSSE_MASK_NO_W]
  49.   vmovups [RESULT], xmm0
  50. end
  51. {$ifdef CPU64}
  52.  ['RAX',                                                        
  53. {$else}
  54.  ['EAX',                                                    
  55. {$endif} 'xmm0', 'xmm1','xmm2','xmm3'];
  56.  


Edit: Made this 32/64 bit safe and be nice to compiler this is getting very messy very quickly! :(


Title: Re: AVX and SSE support question
Post by: dicepd on November 30, 2017, 11:11:31 am
Ok I did a quick google around and it would seem there is no guaranteed way to force the compiler to put the parameters into a register.

Could someone confirm this please? It would be really good if I was wrong.

Peter
Title: Re: AVX and SSE support question
Post by: marcov on November 30, 2017, 11:27:44 am
Ok I did a quick google around and it would seem there is no guaranteed way to force the compiler to put the parameters into a register.

Could someone confirm this please? It would be really good if I was wrong.

Freepascal has no create my own calling convention options. Newer architectures (x86_64 explicitely included) frown upon this anyway, doing so is something from 16-bit dos times.

I do notice that you use RAX outside of cpu64 statements. If you have a bunch of routines like this, maybe a few macros like

{$ifdef unix}
{$ifdef CPU64}
  {$define asmfirstparam:=rdi}
{$else}
  {$define asmfirstparam:=edi
{$endif}
{$else}
{$ifdef CPU64}
  {$define asmfirstparam:=rcx}
{$else}
  {$define asmfirstparam:=ecx}
{$endif}
{$endif}

would reduce the size. It is delphi incompat anyway because of constref.
Title: Re: AVX and SSE support question
Post by: dicepd on November 30, 2017, 11:36:45 am
Hi marcov,

A few pages back this was suggested but macros do not expand inside asm blocks from my testing  :(
Title: Re: AVX and SSE support question
Post by: BeanzMaster on November 30, 2017, 03:00:12 pm
Thanks guys, so i took a look into the .S generated file. We can see

For Native Combine2 function :
Quote
.section .text.n_glzvectormath_new$_$tglznativevector4f_$__$$_combine2$crcf2601943,"x"
   .balign 16,0x90
.globl   GLZVECTORMATH_NEW$_$TGLZNATIVEVECTOR4F_$__$$_COMBINE2$crcF2601943
GLZVECTORMATH_NEW$_$TGLZNATIVEVECTOR4F_$__$$_COMBINE2$crcF2601943:
.Lc126:
# Temps allocated between rbp-16 and rbp+0
.seh_proc GLZVECTORMATH_NEW$_$TGLZNATIVEVECTOR4F_$__$$_COMBINE2$crcF2601943
   # Register rbp allocated
.Ll257:
# [960] begin
   pushq   %rbp
.seh_pushreg %rbp
.Lc128:
.Lc129:
   movq   %rsp,%rbp
.Lc130:
   leaq   -48(%rsp),%rsp
.seh_stackalloc 48
# Var V2 located in register r8
# Var F1 located in register r9
# Var F2 located in register rcx
# Var $self located in register rax
# Var $result located in register rdx
# Temp -16,16 allocated
.seh_endprologue
   # Register rcx,rdx,r8,r9,rax allocated

and For the SSE

Quote
.section .text.n_glzvectormath_new$_$tglzssevector4f_$__$$_combine2$tglzssevector4f$single$single$$tglzssevector4f,"x"
   .balign 16,0x90
.globl   GLZVECTORMATH_NEW$_$TGLZSSEVECTOR4F_$__$$_COMBINE2$TGLZSSEVECTOR4F$SINGLE$SINGLE$$TGLZSSEVECTOR4F
GLZVECTORMATH_NEW$_$TGLZSSEVECTOR4F_$__$$_COMBINE2$TGLZSSEVECTOR4F$SINGLE$SINGLE$$TGLZSSEVECTOR4F:
.Lc403:
.seh_proc GLZVECTORMATH_NEW$_$TGLZSSEVECTOR4F_$__$$_COMBINE2$TGLZSSEVECTOR4F$SINGLE$SINGLE$$TGLZSSEVECTOR4F
   # Register rbp allocated
.Ll832:
# [2233] asm
   pushq   %rbp
.seh_pushreg %rbp
.Lc405:
.Lc406:
   movq   %rsp,%rbp
.Lc407:
   leaq   -32(%rsp),%rsp
.seh_stackalloc 32
# Var V2 located in register r8
# Var F1 located in register r9
# Var $self located in register rcx
# Var $result located in register rdx
.seh_endprologue
# Var F2 located at rbp+48, size=OS_64
   # Register rax,rcx,rdx,r8,r9,r10,r11 allocated

and now for the SSE whit nostackframe and register options :
function TGLZSSEVector4f.Combine2(constref V2: TGLZSSEVector4f;constref F1, F2: Single): TGLZSSEVector4f;assembler;nostackframe;register)
(same resutl without the register option)
Quote
.section .text.n_glzvectormath_new$_$tglzssevector4f_$__$$_combine2$tglzssevector4f$single$single$$tglzssevector4f,"x"
   .balign 16,0x90
.globl   GLZVECTORMATH_NEW$_$TGLZSSEVECTOR4F_$__$$_COMBINE2$TGLZSSEVECTOR4F$SINGLE$SINGLE$$TGLZSSEVECTOR4F
GLZVECTORMATH_NEW$_$TGLZSSEVECTOR4F_$__$$_COMBINE2$TGLZSSEVECTOR4F$SINGLE$SINGLE$$TGLZSSEVECTOR4F:
.Lc403:
# Var V2 located in register r8
# Var F1 located in register r9
# Var $self located in register rcx
# Var $result located in register rdx
# Var F2 located at rbp+48, size=OS_64
# [2233] asm
   # Register rax,rcx,rdx,r8,r9,r10,r11 allocated

note we'll see the stack size is the problem, and the Self is in RAX with native, we can see the difference of the allocated registers

by just adding Begin..End around the Asm..End solve the problem and the F2 var is now correctly assigned to the XMM register

My question it is a compiler issue ? It is possible to increase the stack alloc size manually ? I tried with $M but without success

And i confirm what Peter said :
Quote
macros do not expand inside asm blocks
Title: Re: AVX and SSE support question
Post by: dicepd on November 30, 2017, 03:11:24 pm
Jerome,

Interesting reading here http://wiki.lazarus.freepascal.org/Win64/AMD64_API

It would seem only 4 params are ever put in registers. Self being the first for object/adv struct / classes.

I am missing something on your stack size statement
Quote
note we'll see the stack size is the problem
don't quite understand that bit.
Title: Re: AVX and SSE support question
Post by: BeanzMaster on November 30, 2017, 04:09:19 pm
Quote
note we'll see the stack size is the problem
don't quite understand that bit.

Hi, sorry for my bad english, or my misunderstanding  :-[
in native func we can see .seh_stackalloc 48 and in the sse version .seh_stackalloc 32

So i've made an another little test without Advanced record like this (check comments in)

Code: Pascal  [Select][+][-]
  1. Type
  2.  
  3.   { Tform1 }
  4.   TGLZVector4fType = packed array[0..3] of Single;
  5.   TGLZVector4f = packed record
  6.       case Byte of
  7.       0: (V: TGLZVector4fType);
  8.       1: (X, Y, Z, W: Single);
  9.       //2: (AsVector3f : TGLZVector3f);
  10.   End;
  11.  
  12.   Tform1 = Class(Tform)
  13.     Label1 : Tlabel;
  14.     Label2 : Tlabel;
  15.     Label3 : Tlabel;
  16.     Procedure Formcreate(Sender : Tobject);
  17.     Procedure Formshow(Sender : Tobject);
  18.   Private
  19.  
  20.   Public
  21.     vt1,vt2 : TGLZVector4f;
  22.     Fs1,Fs2 : Single;
  23.   End;
  24.  
  25.  
  26.  
  27. Var
  28.   Form1 : Tform1;
  29.  
  30. Implementation
  31.  
  32. {$R *.lfm}
  33.  
  34. Const cSSE_MASK_NO_W   : array [0..3] of UInt32 = ($FFFFFFFF, $FFFFFFFF, $FFFFFFFF, $00000000);
  35.  
  36. function CreateVector4f(Const aX,aY,aZ,aW : Single):TGLZVector4f;
  37. begin
  38.    Result.X := AX;
  39.    Result.Y := AY;
  40.    Result.Z := AZ;
  41.    Result.W := AW;
  42. end;
  43.  
  44. function Vector4fToString(aVector:TGLZVector4f) : String;
  45. begin
  46.    Result := '(X: '+FloattoStrF(aVector.X,fffixed,5,5)+
  47.             ' ,Y: '+FloattoStrF(aVector.Y,fffixed,5,5)+
  48.             ' ,Z: '+FloattoStrF(aVector.Z,fffixed,5,5)+
  49.             ' ,W: '+FloattoStrF(aVector.W,fffixed,5,5)+')';
  50. End;
  51.  
  52. function NativeCombine2(Const V1, V2: TGLZVector4f;Const  F1, F2: Single): TGLZVector4f;
  53. begin
  54.    Result.X:=( V1.X*F1) + (V2.X*F2);
  55.    Result.Y:=( V1.Y*F1) + (V2.Y*F2);
  56.    Result.Z:=( V2.Z*F1) + (V2.Z*F2);
  57.    Result.W:=0;
  58. end;
  59.  
  60. function SSECombine2(Const V1, V2: TGLZVector4f; Const F1,F2: Single): TGLZVector4f;assembler;
  61. asm
  62.  
  63.   movups xmm0,[V1]
  64.   movups xmm1, [V2]
  65.   movss xmm2, F1     //--->   unit1.pas(97,15) Warning: Check size of memory operand "movss: memory-operand-size is 32 bits, but expected [128 bits]"
  66.   //movlps xmm2, F1  //--->  unit1.pas(97,3) Error: Asm: [movlps xmmreg,xmmreg] invalid combination of opcode and operands
  67.  
  68.   //movlps xmm3, F2    //--> NO WARNING, NO ERROR , with MOVSS xmm2, F1. But wrong result
  69.   movss xmm3, F2   //--> unit1.pas(99,15) Warning: Check size of memory operand "movss: memory-operand-size is 32 bits, but expected [128 bits]"
  70.  
  71.   shufps xmm2, xmm2, $00 // replicate
  72.   shufps xmm3, xmm3, $00 // replicate
  73.  
  74.   mulps xmm0, xmm2  // Self * F1
  75.   mulps xmm1, xmm3  // V2 * F2
  76.  
  77.   addps xmm0, xmm1  // (Self * F1) + (V2 * F2)
  78.  
  79.   andps xmm0, [RIP+cSSE_MASK_NO_W]
  80.   movups [RESULT], xmm0
  81. end;
  82.  
  83. function AVXCombine2(Const V1, V2: TGLZVector4f;Const  F1, F2: Single): TGLZVector4f;assembler;
  84. asm
  85.   vmovups xmm0,[V1]
  86.   vmovups xmm1, [V2]
  87.   // vmovss xmm2, F1  //--> unit1.pas(118,3) Error: Asm: [vmovss xmmreg,xmmreg] invalid combination of opcode and operands
  88.   //vmovlps xmm2, F1    //--> unit1.pas(119,3) Error: Asm: [vmovlps xmmreg,xmmreg] invalid combination of opcode and operands
  89.   //vmovlps xmm3, F2 // Same error here, also with vmovss
  90.   //vmovups xmm3, F1   //--> unit1.pas(122,17) Warning: Check size of memory operand "vmovups: memory-operand-size is 32 bits, but expected [128 bits]"
  91.   //vmovups xmm3, F2   //--> Idem above
  92.   // ALL ABOVE GIVE WRONG RESULT
  93.  
  94.   movss xmm3, [F1]   //--> Using SSE instruction give good result but always warning
  95.   movss xmm3, [F2]
  96.  
  97.   vshufps xmm2, xmm2, xmm2, $00 // replicate
  98.   vshufps xmm3, xmm3, xmm3, $00 // replicate
  99.  
  100.   vmulps xmm0, xmm0, xmm2  // Self * F1
  101.   vmulps xmm1, xmm1, xmm3  // V2 * F2
  102.  
  103.   vaddps xmm0, xmm0, xmm1  // (Self * F1) + (V2 * F2)
  104.  
  105.   vandps xmm0, xmm0, [RIP+cSSE_MASK_NO_W]
  106.   vmovups [RESULT], xmm0
  107. end;
  108.  
  109. { Tform1 }
  110.  
  111. Procedure Tform1.Formcreate(Sender : Tobject);
  112. Begin
  113.   vt1:= CreateVector4f(5.850,-15.480,8.512,1.5);
  114.   vt2:= CreateVector4f(1.558,6.512,4.525,1.0);
  115.   Fs1 := 1.5;
  116.   Fs2 := 5.5;
  117. End;
  118.  
  119. Procedure Tform1.Formshow(Sender : Tobject);
  120. Begin
  121.   Label1.Caption := Vector4fToString(NativeCombine2(Vt1,Vt2,Fs1, Fs2));
  122.   Label2.Caption := Vector4fToString(SSECombine2(Vt1,Vt2,Fs1, Fs2));
  123.   Label3.Caption := Vector4fToString(AVXCombine2(Vt1,Vt2,Fs1, Fs2));
  124. End;  
  125.  

Now by surrounding asm..end block by begin..end solve problem but always Warnings.  Except for AVX VMOSS always return Error the solution is used MOVSS.
And now only WARNINGS but results are corrects

So i think i'll just surounding those 2 functions in "Advanced record" or simply make it inline without use ASM code and simply use operators "Result := (V1*F1) + (V2*F2)"; In all case those 2 functions are not really important and not used often in GLScene. So, I'll see later. And i keep in my mind the maximum args is 2 + the "Self"
Title: Re: AVX and SSE support question
Post by: dicepd on November 30, 2017, 04:18:07 pm
unit1.pas(97,15) Warning: Check size of memory operand "movss: memory-operand-size is 32 bits, but expected [128 bits]

This warning is just wrong there is nothing wrong with a movss only moving 32 bits this is a really bad warning from the compiler and should just be {$-h} away when it is encountered/
Title: Re: AVX and SSE support question
Post by: dicepd on November 30, 2017, 06:13:07 pm
Hi Jerome,

I have been playing around with this some more in Linux. I can get the SSECombine2 down to just the following (from your little test code)

Code: Pascal  [Select][+][-]
  1. function SSECombine2(constref V1, V2: TGLZVector4f; Const F1,F2: Single): TGLZVector4f;assembler;
  2. asm
  3.   movups xmm2, [V1]
  4.   movups xmm3, [V2]
  5.  
  6.   shufps xmm0, xmm0, $00 // replicate  F1
  7.   shufps xmm1, xmm1, $00 // replicate  F2
  8.  
  9.   mulps xmm2, xmm0  // Self * F1
  10.   mulps xmm3, xmm1  // V2 * F2
  11.  
  12.   addps xmm2, xmm3  // (Self * F1) + (V2 * F2)
  13.  
  14.   andps xmm2, [RIP+cSSE_MASK_NO_W]
  15.   movups [RESULT], xmm2
  16. end;
  17.  

whereas the optimum for windows would be

Code: Pascal  [Select][+][-]
  1. function SSECombine2(constref V1, V2: TGLZVector4f; Const F1,F2: Single): TGLZVector4f;assembler;
  2. asm
  3.   movups xmm0, [V1]
  4.   movups xmm1, [V2]
  5.   movss xmm2, [F2{%H-}]
  6.  
  7.   shufps xmm3, xmm3, $00 // replicate
  8.   shufps xmm2, xmm2, $00 // replicate
  9.  
  10.   mulps xmm0, xmm3  // Self * F1
  11.   mulps xmm1, xmm2  // V2 * F2
  12.  
  13.   addps xmm0, xmm1  // (Self * F1) + (V2 * F2)
  14.  
  15.   andps xmm0, [RIP+cSSE_MASK_NO_W]
  16.   movups [RESULT], xmm0
  17. end;
  18.  


Might it not be better to have two inc files for the implementation which are linux and win specific and can be optimized according to their respective abis?

Peter
Title: Re: AVX and SSE support question
Post by: marcov on November 30, 2017, 07:43:40 pm
FWIW, while this thread was running, I've been playing with SSE (albeit in Delphi, since for work) too in the past two weeks, so I thought I post some code.

It is more of an integer SSSE3 routine, rotating a block of 8x8 bytes with a loop around it for a bit of loop tiling.  See rot 8x8 here (http://www.stack.nl/~marcov/rot8x8.txt).

The related stackoverflow thread is at why does haswell+ suck? (https://stackoverflow.com/questions/47478010/sse2-8x8-byte-matrix-transpose-code-twice-as-slow-on-haswell-then-on-ivy-bridge')
Title: Re: AVX and SSE support question
Post by: CuriousKit on November 30, 2017, 09:46:48 pm
On the subject, I wrote a load of SSE, AVX and FMA routines primarily for graphics programming, namely taking an array of vectors and transforming them by a 4x4 matrix, for example.  Would any of those be useful for your collection or for Lazarus in general?  There's still some room for improvement though, since I don't take advantage of memory alignment.

I know the topic is mostly on compiler support and optimisation, but is it worth having some standardised vector and matrix functions that make use of SSE and AVX if available? I know FPC has some 2, 3 and 4-component vector and matrix functions, but they're very generalised and not particularly fast when dealing with large datasets.
Title: Re: AVX and SSE support question
Post by: BeanzMaster on November 30, 2017, 10:50:35 pm
Hi to all,

first
You move the value to the stack. You should probably move to the pointer that is ON the stack.

so

Code: Pascal  [Select][+][-]
  1. mov  rax,48(%rbp)  // or whatever free register.
  2. movss (rax),%xmm2

I tried
Code: Pascal  [Select][+][-]
  1.  
  2.   mov r12, [rbp+48] //GLZVectorMath_NEW.pas(2264,19) Warning: Use of +offset(%ebp) for parameters invalid here  movss xmm3,r12  
  3.   movss xmm3, r12 // GLZVectorMath_NEW.pas(2265,3) Error: Asm: [movss xmmreg,reg64] invalid combination of opcode and operands
  4.  
it doesn't work (i've tried also with rax)

and

Code: Pascal  [Select][+][-]
  1.  
  2.   movss xmm3,[RBP+48]  //GLZVectorMath_NEW.pas(2265,21) Warning: Use of +offset(%ebp) for parameters invalid here
  3.  
Compile but, result is wrong

Hi Jerome,

I have been playing around with this some more in Linux. I can get the SSECombine2 down to just the following (from your little test code)

whereas the optimum for windows would be

Code: Pascal  [Select][+][-]
  1. function SSECombine2(constref V1, V2: TGLZVector4f; Const F1,F2: Single): TGLZVector4f;assembler;
  2. asm
  3.   movups xmm0, [V1]
  4.   movups xmm1, [V2]
  5.   movss xmm2, [F2{%H-}]
  6.  
  7.   shufps xmm3, xmm3, $00 // replicate
  8.   shufps xmm2, xmm2, $00 // replicate
  9.  
  10.   mulps xmm0, xmm3  // Self * F1
  11.   mulps xmm1, xmm2  // V2 * F2
  12.  
  13.   addps xmm0, xmm1  // (Self * F1) + (V2 * F2)
  14.  
  15.   andps xmm0, [RIP+cSSE_MASK_NO_W]
  16.   movups [RESULT], xmm0
  17. end;
  18.  

It work but not in the  Advanced Record :

Quote
# Var V2 located in register r8
# Var F1 located in register r9
# Var $self located in register rcx
# Var $result located in register rdx
.seh_endprologue
# Var F2 located at rbp+48, size=OS_64

Actually the only thing that solve the problem is by surrounding Asm..End block by a Begin..End  :'(

Code: Pascal  [Select][+][-]
  1. function TGLZSSEVector4f.Combine2(constref V2: TGLZSSEVector4f;constref F1, F2: Single): TGLZSSEVector4f;
  2. Begin
  3.   asm
  4.   {$ifdef UNIX}
  5.     {$ifdef CPU64}
  6.        movups xmm0,[RDI]
  7.     {$else}
  8.        movups xmm0,[EDI]
  9.     {$endif}
  10.   {$else}
  11.     {$ifdef CPU64}
  12.        movups xmm0,[RCX]
  13.     {$else}
  14.        movups xmm0,[EAX]
  15.     {$endif}
  16.   {$endif}
  17.     movups xmm1, [V2]
  18.  
  19.     movlps xmm2,[F1]
  20.     movlps xmm3,[F2]
  21.  
  22.     shufps xmm2, xmm2, $00 // replicate
  23.     shufps xmm3, xmm3, $00 // replicate
  24.  
  25.     mulps xmm0, xmm2  // Self * F1
  26.     mulps xmm1, xmm3  // V2 * F2
  27.  
  28.     addps xmm0, xmm1  // (Self * F1) + (V2 * F2)
  29.     {$IFDEF CPU64}
  30.       andps xmm0, [RIP+cSSE_MASK_NO_W]
  31.     {$ELSE}
  32.       andps xmm0, [cSSE_MASK_NO_W]
  33.     {$ENDIF}
  34.     movups [RESULT], xmm0 // If i'm remember my last test this line is not needed in 32bit, because result is store in xmm0
  35.   end;
  36. End;  
  37.  

Might it not be better to have two inc files for the implementation which are linux and win specific and can be optimized according to their respective abis?

Yes i think too, i'll probably make 2 inc in the final unit

FWIW, while this thread was running, I've been playing with SSE (albeit in Delphi, since for work) too in the past two weeks, so I thought I post some code.

It is more of an integer SSSE3 routine, rotating a block of 8x8 bytes with a loop around it for a bit of loop tiling.  See rot 8x8 here (http://www.stack.nl/~marcov/rot8x8.txt).

The related stackoverflow thread is at why does haswell+ suck? (https://stackoverflow.com/questions/47478010/sse2-8x8-byte-matrix-transpose-code-twice-as-slow-on-haswell-then-on-ivy-bridge')

Very interesting, but i don't understand all yet  :-[
It will be very intersting making some test with Bitmap

On the subject, I wrote a load of SSE, AVX and FMA routines primarily for graphics programming, namely taking an array of vectors and transforming them by a 4x4 matrix, for example.  Would any of those be useful for your collection or for Lazarus in general?  There's still some room for improvement though, since I don't take advantage of memory alignment.

I know the topic is mostly on compiler support and optimisation, but is it worth having some standardised vector and matrix functions that make use of SSE and AVX if available? I know FPC has some 2, 3 and 4-component vector and matrix functions, but they're very generalised and not particularly fast when dealing with large datasets.

Yes it's welcome your, code could be help a lot. (and not only me, i'm sure) Perhaps if you are agree i'll can try to implement yours functions in GLScene and my own project (a new GLScene, with it's own fast bitmap management. And will support opengl core, and vulkan  8) )

Cheers

Title: Re: AVX and SSE support question
Post by: dicepd on November 30, 2017, 11:04:44 pm
Quote
It work but not in the  Advanced Record :

Quote

    # Var V2 located in register r8
    # Var F1 located in register r9
    # Var $self located in register rcx
    # Var $result located in register rdx
    .seh_endprologue
    # Var F2 located at rbp+48, size=OS_64


Actually the only thing that solve the problem is by surrounding Asm..End block by a Begin..End  :'(

Ok about this and the difference between the two examples.

F2 in that case was declared as Const Ref so had 64 bit pointer on stack and needed pulling into a register for register memory addressing.

In the code above we declare as const only, this places the value on the stack and only needs addressing by stack+offset removing the need to load a pointer into a register.

This is only possible as I was optimising those routines according to the platform.

Peter

Edit I will see if I can make that one routine work in the large test app!
Title: Re: AVX and SSE support question
Post by: dicepd on November 30, 2017, 11:40:13 pm
And here are the two variants that work for TGLZAVXVector4f.Combine2.

If you paste that into the larger test app overwriting the existing routine.

You will have to take it from me that the Unix variant works, as I have this code in one shared dir that I can compile both linux and win64.

Code: Pascal  [Select][+][-]
  1. {$ifdef UNIX}
  2.  
  3. function TGLZAVXVector4f.Combine2(constref V2: TGLZAVXVector4f; Const F1, F2: Single): TGLZAVXVector4f;assembler;
  4. asm
  5.   {$ifdef CPU64}
  6.      vmovups xmm2,[RDI]
  7.   {$else}
  8.      vmovups xmm2,[EDI]
  9.   {$endif}
  10.  
  11.   vmovups xmm3, [V2]
  12.  
  13.   vshufps xmm0, xmm0, xmm0, $00 // replicate
  14.   vshufps xmm1, xmm1, xmm1, $00 // replicate
  15.  
  16.   vmulps xmm2, xmm2, xmm0  // Self * F1
  17.   vmulps xmm3, xmm3, xmm1  // V2 * F2
  18.  
  19.   vaddps xmm2, xmm2, xmm3  // (Self * F1) + (V2 * F2)
  20.  
  21.   vandps xmm2, xmm2, [RIP+cSSE_MASK_NO_W]
  22.   vmovups [RESULT], xmm2
  23. end;
  24.  
  25. {$else}
  26.  
  27. function TGLZAVXVector4f.Combine2(constref V2: TGLZAVXVector4f; Const F1, F2: Single): TGLZAVXVector4f;assembler;
  28. asm
  29.   {$ifdef CPU64}
  30.      vmovups xmm0,[RCX]
  31.   {$else}
  32.      vmovups xmm0,[ECX]
  33.   {$endif}
  34.  
  35.   vmovups xmm1, [V2]
  36.   movss xmm2, [F2]
  37.  
  38.   vshufps xmm2, xmm2, xmm2, $00 // replicate F2
  39.   vshufps xmm3, xmm3, xmm3, $00 // replicate F1 already here
  40.  
  41.   vmulps xmm0, xmm0, xmm3  // Self * F1
  42.   vmulps xmm1, xmm1, xmm2  // V2 * F2
  43.  
  44.   vaddps xmm0, xmm0, xmm1  // (Self * F1) + (V2 * F2)
  45.  
  46.   vandps xmm0, xmm0, [RIP+cSSE_MASK_NO_W]
  47.   vmovups [RESULT], xmm0
  48. end;
  49. {$endif}
  50.  
Title: Re: AVX and SSE support question
Post by: dicepd on December 01, 2017, 12:51:45 am
I took the plunge and did the lot, here is the GLVectorMath.pas that works in both windows and linux

Peter
Title: Re: AVX and SSE support question
Post by: CuriousKit on December 01, 2017, 08:34:18 am
...
On the subject, I wrote a load of SSE, AVX and FMA routines primarily for graphics programming, namely taking an array of vectors and transforming them by a 4x4 matrix, for example.  Would any of those be useful for your collection or for Lazarus in general?  There's still some room for improvement though, since I don't take advantage of memory alignment.

I know the topic is mostly on compiler support and optimisation, but is it worth having some standardised vector and matrix functions that make use of SSE and AVX if available? I know FPC has some 2, 3 and 4-component vector and matrix functions, but they're very generalised and not particularly fast when dealing with large datasets.

Yes it's welcome your, code could be help a lot. (and not only me, i'm sure) Perhaps if you are agree i'll can try to implement yours functions in GLScene and my own project (a new GLScene, with it's own fast bitmap management. And will support opengl core, and vulkan  8) )

Cheers
I'll see what I can do.

Regarding parameters... from my experience, all record types, including 4-vectors, are passed by reference, so you have to dereference the pointer.  The only time it might possibly be passed by value is with Microsoft's "vectorcall" calling convention, which Free Pascal doesn't support to my knowledge (feature request?).

At the moment, I'm working on an older laptop that doesn't have FMA or AVX2 support, so I can't test routines that use it just yet, although I'll see if I can get Intel's tool that emulates said features.  Looking at the Combine2 code, while one could squeeze in an FMA command to replace the second multiplication and the final addition, but I don't think it will actually provide any kind of speed gain because of how the uops are distributed (i.e. the two multiplications run in parallel, then their results combined, taking 2 cycles at the absolute minimum (the multiplications might take more than 1 cycle), whereas the FMA will have to wait on the first multiplication before it can proceed, also taking 2 cycles minimum) - it might only provide a minor speed boost if the CPU is heavily bottlenecked.

Before I submit my code though, I'm going to see if I can improve on the memory management a little bit, since I do unaligned reads everywhere instead of attempting to reconfigure it for the faster aligned reads (although I'm not sure if there is any discernible difference in performance on modern processors).
Title: Re: AVX and SSE support question
Post by: dicepd on December 01, 2017, 10:01:57 am
Ok I have taken the time to setup a Win32 VM and a Linux 32 VM.

Firstly Self is in EAX for both 32 bit platforms,

Next it loolks like a few more routines will require unix v windows calling convention rework.   I am looking a AVXLerp atm.  SSE Lerp seems to work ok but the AVX suffers the same symptoms as AVXCombine.

Just exploring the issues atm.

Peter

Title: Re: AVX and SSE support question
Post by: CuriousKit on December 01, 2017, 10:50:11 am
On 32-bit, Free Pascal and Delphi use their own calling convention where the first three integer-sized parameters are passed into EAX, EDX and ECX in that order, with Self being a hidden first parameter if it is required.

If the return type is an integer or pointer type, it is returned in EAX, similar to 64-bit. One big difference though is when dealing with a more complex return type like a record.  In 32-bit, it's passed as an extra parameter by reference on the right, whereas on 64-bit, it's passed by reference on the left.  Take for example the following function:

Code: Pascal  [Select][+][-]
  1. function VectorAdd(Vector1, Vector2: TVector4): TVector4;

(Assume that TVector4 is just a packed record with components X, Y, Z and W, all of type Single)

- On 32-bit, @Vector1 is EAX, @Vector2 is EDX and @Result is ECX.
- On Win64, @Result is RCX, @Vector1 is RDX and @Vector2 is R8.

I'm not sure about Linux, and I'm not sure yet what happens if Self is required as well - actually, it looks like that was answered at the top of this page!
Title: Re: AVX and SSE support question
Post by: dicepd on December 01, 2017, 11:37:55 am
What I have found so far is that if the result variable for a size greater than 32 bit is on the stack in 32bit then the stack has a 32 bit pointer to the result variable. This code works in this case

Code: Pascal  [Select][+][-]
  1.   {$ifdef cpu32}
  2.     mov ecx, Result
  3.     vmovups [ecx], xmm0
  4.   {$else}
  5.     vmovups [Result], xmm0
  6.   {$endif}      
Title: Re: AVX and SSE support question
Post by: CuriousKit on December 01, 2017, 11:46:08 am
That adds up.  If there are no free registers (i.e. the 4th parameter and beyond), then they get pushed to the stack.
Title: Re: AVX and SSE support question
Post by: dicepd on December 01, 2017, 02:55:24 pm
Ok getting near to beer o'clock, so here is the latest it works in win32 win64, linux32 linux64.

I have cleaned up the starting defs but not the return defs, 32bit was not that bad and has not added too.... much crud, not as much as I though it might.

The only numbers that are wrong are those that have always been wrong i.e. Perpendicular, looks like it needs negating but I will leave that to you Jerome to make a call on that.

Peter

Edit put the right file there :-[
Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 01, 2017, 04:41:40 pm
Ok getting near to beer o'clock, so here is the latest it works in win32 win64, linux32 linux64.

I have cleaned up the starting defs but not the return defs, 32bit was not that bad and has not added too.... much crud, not as much as I though it might.

The only numbers that are wrong are those that have always been wrong i.e. Perpendicular, looks like it needs negating but I will leave that to you Jerome to make a call on that.

Peter

Edit put the right file there :-[

Hi Peter hooa !! you did did many works  8-) I'll take a look tonight and this week-end, now it's the time for me to going back to my job  :'( and until christmas's hollidays i've many work. Next week I've more than 350 people at the restaurant in 2 days  :o

Good Beer !  :D
Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 02, 2017, 11:37:47 am
Hi to all,

Quote
The only numbers that are wrong are those that have always been wrong i.e. Perpendicular, looks like it needs negating but I will leave that to you Jerome to make a call on that.
Corrected, simple operand inversion in the last operation (subps)

Like suggest Peter i've splitted AXV/SSE and native into 3 include file and i've made a very simple little  benchmark
This test have some different Buidingmode with different compiling options. The results are suprising.
All build mode have : -O3   : Level 3 optimizations

List of the compilers options for each build mode :

The average results for all test with my PC in seconds
|  RELEASE |  RELEASE_SSE |  RELEASE_SSE3 |  RELEASE_SSE4 |  RELEASE_AVX |  RELEASE_AVX2
NATIVE |  4,769 |  4,500 |  4,575 |  4,563 |  4,793 |  4,759
SSE |  2,042 |  2,038 |  2,038 |  2,038 |  1,997 |  1,973
SSE 3 |  2,009 |  1,998 |  1,999 |  2,005 |  1,992 |  1,977
SSE 4 |  1,991 |  1,982 |  1,987 |  1,979 |  1,9651,961
AVX |  2,308 |  2,298 |  2,293 |  2,291 |  2,200 |  2,181

Like we see the best result is with SSE4 and RELEASE_AVX2. the AVX functions are not as fast as I thought (but with a Matrix4 it will be different)
After for SSE the better compiler option idepend of the SSE version used. But with AVX2 compiling options will have the best results
And like we see using AVX with vector it's not really beneficial
But globally with asm function the gain of speed is around 50% it's already very good :D

NB : In project take a loo at the Lenght and Distance Result with Native functions it' surprising (in my case it's the better result)

See the attached project for test. Choose BuildMode  and set the right DEFINE option in the top of GLZVectorMath Unit

EDIT : I've forget to add nostackframe; register; after Assembler; for AVX so the result will be better with (don't do this for length, distance, norm, AngleCosine and DotProduct those break the functions  >:D )  :-[
Title: Re: AVX and SSE support question
Post by: dicepd on December 02, 2017, 12:32:05 pm
Fist cross platform issue. 

Quote
vectormath_sse_imp.inc(9,3) Error: This function's result location cannot be encoded directly in a single operand when "nostackframe" is used
vectormath_sse_imp.inc(9,3) Error: Asm: [movups reg64,xmmreg] invalid combination of opcode and operands


Code: Pascal  [Select][+][-]
  1.   MOVUPS XMM0,[A]
  2.   MOVUPS XMM1,[B]
  3.   SUBPS  XMM0,XMM1
  4.   MOVUPS [RESULT], XMM0  <-- It does not like this line  
  5.  

Works fine in win64 will try to compare some asm output and see what it is trying to do.
Title: Re: AVX and SSE support question
Post by: marcov on December 02, 2017, 12:47:00 pm
Code: Pascal  [Select][+][-]
  1.   MOVUPS XMM0,[A]
  2.   MOVUPS XMM1,[B]
  3.   SUBPS  XMM0,XMM1
  4.   MOVUPS [RESULT], XMM0  <-- It does not like this line  
  5.  

Works fine in win64 will try to compare some asm output and see what it is trying to do.

Logical since if result is NOT a register it would amount to

Code: Pascal  [Select][+][-]
  1. movups [[ebp][8]], xmm0

which is not valid addressing that way only works when the variable is in a register.
Title: Re: AVX and SSE support question
Post by: dicepd on December 02, 2017, 01:00:13 pm
Yes it would seem that Result is on the stack, looking at the asm of both win and linux this code would be more efficient than removing nostackframe.

Code: Pascal  [Select][+][-]
  1. MOV [E/R]AX, RESULT
  2. MOVEUPS  [[E/R]AX], XMM0

Edit
Nope the above is totally wrong it has allocated 4 singles on the stack not a pointer to the 4 singles.
Seems that Win64 has better handling of nostackframe and register as Return is a pointer not a value.

Any way to hint at the compiler to use a pointer as a return as against a stack value?

Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 02, 2017, 01:47:54 pm
Hi, it's totaly normal the nostackframe break the function with nostackframe the result is in xmm0 so for example in length function just removing the Movss [Result],xmm0 is enougth
I'll post another little update later in the week-end  ::)

Code: Pascal  [Select][+][-]
  1. function TGLZVector4f.Length:Single;assembler; nostackframe; register;
  2. //Result := Sqrt((Self.X * Self.X) +(Self.Y * Self.Y) +(Self.Z * Self.Z));
  3. Asm
  4.  
  5.  
  6.   {$IFDEF USE_ASM_SSE_4}
  7.     {$ifdef CPU64}
  8.       {$ifdef UNIX}
  9.          movups xmm0,[RDI]
  10.       {$else}
  11.         movups xmm0,[RCX]
  12.       {$endif}
  13.     {$else}
  14.          movups xmm0,[EAX]
  15.     {$endif}
  16.     dpps xmm0, xmm0, $FF;
  17.     sqrtss xmm0, xmm0
  18.   {$ELSE}
  19.  
  20.     //we need to remove W component
  21.     //andps xmm0, [RIP+cSSE_MASK_NO_W]
  22.  
  23.     {$IFDEF USE_ASM_SSE_3}
  24.       {$ifdef CPU64}
  25.         {$ifdef UNIX}
  26.            movups xmm0,[RDI]
  27.         {$else}
  28.           movups xmm0,[RCX]
  29.         {$endif}
  30.       {$else}
  31.            movups xmm0,[EAX]
  32.       {$endif}
  33.       mulps   xmm0, xmm0
  34.       haddps xmm0, xmm0
  35.       haddps xmm0, xmm0
  36.       sqrtss xmm0, xmm0
  37.     {$ELSE}
  38.     {$ifdef CPU64}
  39.       {$ifdef UNIX}
  40.          movups xmm1,[RDI]
  41.       {$else}
  42.         movups xmm1,[RCX]
  43.       {$endif}
  44.     {$else}
  45.          movups xmm1,[EAX]
  46.     {$endif}
  47.       mulps   xmm1, xmm1
  48.       movhlps xmm0, xmm1
  49.       addss xmm0, xmm1
  50.       shufps xmm1, xmm1, $55
  51.       addss xmm0, xmm1
  52.       {.$IFDEF USE_ASM_SIMD_HIGHPRECISION}
  53.       // High Precision
  54.       sqrtss xmm0, xmm0
  55.       {.$ELSE
  56.           // Low precision - note : may be very inaccurate
  57.           rsqrtss xmm0, xmm0
  58.           rcpss xmm0, xmm0
  59.       .$ENDIF}
  60.     {$ENDIF}
  61.   {$ENDIF}
  62. end;  

now this function is almost 3 times faster than the native function  8-)

Now i must check all others functions
Title: Re: AVX and SSE support question
Post by: dicepd on December 02, 2017, 02:11:24 pm
Just tested this with great result.

native pascal
operator TGLZVector4f.+ timing is 514ms

nostackframe commented out and MOVUPS [RESULT], XMM0 in the code
operator TGLZVector4f.+ timing is 200ms

adding nostackframe and removing MOVUPS [RESULT], XMM0
operator TGLZVector4f.+ timing is 65ms

 ;D

Ok more testing and this time not good news  :(  It worried me we might be getting correct results only because of registers being in a certain state from the last calc, so I added

Code: Pascal  [Select][+][-]
  1.   StartTimer;
  2.   For cnt:= 1 to 20000000 do begin v3 := v1 + v2; v4 := v1 + v1; end;
  3.   StopTimer;
  4.   With StringGrid1 do
  5.   begin
  6.     Cells[1,1] := v3.ToString + v4.ToString ;
  7.     Cells[2,1] := WriteTimer;
  8.   End;                          

And then garbage come out in v3 and v4

so removal of result line is not an option.
Title: Re: AVX and SSE support question
Post by: dicepd on December 02, 2017, 03:08:04 pm
both 32 bit Linux and Win need the  return value setting

so
Code: Pascal  [Select][+][-]
  1. {$ifdef CPU32}
  2. MOVUPS [RESULT], XMM0
  3. {$endif}
  4.  

is required when nostackframe is used.
Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 02, 2017, 04:03:14 pm
Quote
Ok more testing and this time not good news  :(  It worried me we might be getting correct results only because of registers being in a certain state from the last calc, so I added

Code: Pascal  [Select][+][-]
  1.   StartTimer;
  2.   For cnt:= 1 to 20000000 do begin v3 := v1 + v2; v4 := v1 + v1; end;
  3.   StopTimer;
  4.   With StringGrid1 do
  5.   begin
  6.     Cells[1,1] := v3.ToString + v4.ToString ;
  7.     Cells[2,1] := WriteTimer;
  8.   End;

For me this give me right results  :o

so if i'm understanding well, with  WIN 64 bit other than Windows 10, and on 32bits we need to keep MOVUPS [RESULT], XMM0

Another thing adding nostackframe break also Combine2 and Combine3 of course
the solution is adding on top
Code: Pascal  [Select][+][-]
  1. push rbp
  2. mov rsp, rbp

and on bottom
Code: Pascal  [Select][+][-]
  1. pop rbp

which is a little logical in the background
Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 02, 2017, 04:19:57 pm
Just a little question Peter have you delete the Mov [Result], xmm0 from class operator TGLZVector4f.+(constref A, B: TGLZVector4f): TGLZVector4f; ?

Because what i said about this,  is just for Distance, Length, Norm, DotProduct and AngleCosine where we not needed the "Mov [Result], xmm0" for others is needed

Code: Pascal  [Select][+][-]
  1. class operator TGLZVector4f.+(constref A, B: TGLZVector4f): TGLZVector4f; assembler; nostackframe; register;
  2. asm
  3.   MOVUPS XMM0,[A]
  4.   MOVUPS XMM1,[B]
  5.   ADDPS  XMM0,XMM1
  6.   MOVUPS [RESULT], XMM0
  7. end;


Quote
# Var A located in register rdx
# Var B located in register r8
# Var $result located in register rcx

and for length for example

Quote
# Var $self located in register rcx
# Var $result located in register xmm0

EDIT : Sorry for this misundestanding  :-[
REEDIT : with 
Code: Pascal  [Select][+][-]
  1. For cnt:= 1 to 20000000 do begin v3 := v1 + v2; v4 := v1 + v1; end;
for me the timing is around 0.0450 sec
Title: Re: AVX and SSE support question
Post by: dicepd on December 02, 2017, 04:57:31 pm
Jerome I have not got any further than playing around with operator TGLZVector4f.+ in all combinations.

I really would like to get to the bottom of why we can't use nostackframe in unix64. It seems very strange that linux32 win32 and win64 can all use nostackframe but not linux 64.

I am going to try your  push rbp to see if that makes any difference.

As of now:

Win7 64 works fine with nostackframe as long as MOVUPS [RESULT] , XMM0 is there but not without.

32 bit  OSes are the same as Win7 64.

So only Linux64 cannot use nostackframe because it is allocating 128bits on the stack for the result and it would seem you are not allowed to use the stack with nostackframe.

So I will try to find a solution to this that will allow linux64 to use nostackframe.





Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 02, 2017, 06:14:04 pm
Add -a<X> compiler options to the project and check the generated .S file it will give you some clue on about register and stack are use with linux64
Title: Re: AVX and SSE support question
Post by: dicepd on December 02, 2017, 08:02:02 pm
Add -a<X> compiler options to the project and check the generated .S file it will give you some clue on about register and stack are use with linux64

Thats what I am doing and linux64 insists on putting result on stack by value (a full copy of mmx reg allocated on stack)
However if you use nostackpointer you can't access the stack to return the value. Looking more and more to me like a bug / missing feature in the linux 64 compiler.
Quote
# Temp -16,16 allocated
# Var $result located at rbp-16, size=OS_128
   # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 allocated

I have tried MOVUPS [RBP]-16, XMM0. This at least get past the compiler errors
but that seems to upset the stack in some way as I trash the button click stack. and also return the wrong numbers. x and y are correct z and w are some strange value that relates to no value in any mmx register

All other three variants use something like
# Var $result located in register rcx
So you can push the result through register addressing.

Title: Re: AVX and SSE support question
Post by: CuriousKit on December 02, 2017, 09:17:07 pm
I've started to have a look at this unit and the benchmark program myself, since I've worked on similar stuff in the past.

The first thing I'll say is that it's very hard to make the assembly language generalised (and impossible if you venture away from x86 machine code), since certain optimisations can only be made if you know in question.  For example, under Win64 and Linux64, floating-point parameters are passed in XMM registers, hence it's a waste of cycles to simply move them around.

Another trick when you have a Boolean return type... you can guarantee on 32-bit and 64-bit on Windows (and I think on Linux too) that the Result is stored in AL.  So for example, if you wanted to check the signs of an XMM register, you can do
Code: Pascal  [Select][+][-]
  1. MOVMSKPS  EAX,  XMM0
  2. TEST      EAX,  EAX
  3. SETNZ     AL
This cuts down on an expensive conditional jump.

NOTE: Boolean is only 1 byte in Pascal, so AL does not need to be zero-extended or EAX zeroed beforehand - if you need a WordBool or LongBool result, then you have to add MOVZX AX, AL or MOVZX EAX, AL respectively to the end of the code.

As was done by BeanzMaster, it is a good idea to split the SSE and AVX assembly into different include files so it's easier to debug and extend (e.g. if you want to add support for a completely different CPU family that is incompatible with Intel/AMD).
Title: Re: AVX and SSE support question
Post by: CuriousKit on December 02, 2017, 10:09:19 pm
Regarding the problems with the stack in Linux 64 - I can't test it myself, but I would put a breakpoint on the call into the function in question, and study the disassembly to see how the parameters are being configured in the registers and the stack.  That should offer you some clues.
Title: Re: AVX and SSE support question
Post by: dicepd on December 02, 2017, 11:10:24 pm
At last a breakthrough the following works for linux64 with others working as before

Code: Pascal  [Select][+][-]
  1. class operator TGLZVector4f.+(constref A, B: TGLZVector4f): TGLZVector4f; register; assembler; nostackframe;
  2. asm
  3.   MOVUPS XMM0,[A]
  4.   MOVUPS XMM1,[B]
  5.   ADDPS  XMM0,XMM1
  6.   {$ifdef UNIX}
  7.     {$ifdef CPU64}
  8.     MOVHLPS XMM1, XMM0
  9.     {$else}
  10.     MOVUPS [RESULT], XMM0
  11.     {$endif}
  12.   {$else}
  13.     MOVUPS [RESULT], XMM0
  14.   {$endif}
  15.  
  16. end;
  17.  

It would seem the return convention is xy in low xmm0 zw in low xmm1 not that I could find that anywhere without a duh why did I not see that before moment on the following assembler that worked. I initially thought that 2 and 3 where just part of the setup and teardown of the frame.
 
 
Code: Pascal  [Select][+][-]
  1. movups %xmm0,-0x10(%rbp)
  2. movq   -0x10(%rbp),%xmm0
  3. movq   -0x8(%rbp),%xmm1

So the linux compiler pushes the result to the stack then writes low 2 singles back to xmm0 and the high two singles to xmm1 done more efficiently with the  MOVHLPS XMM1, XMM0

adding nostackframe and using  MOVHLPS XMM1, XMM0
operator TGLZVector4f.+ timing is now 69ms
so nearly a three times speed up from stack frame and 7.44 times quicker than native pascal.
Title: Re: AVX and SSE support question
Post by: CuriousKit on December 03, 2017, 12:08:25 am
That's a little confusing with Linux, because the way it's behaving implies that it's splitting the 128-bit into two, classing the lower half as SSE and the upper half as SSEUP (see pages 15-17 here: http://refspecs.linuxbase.org/elf/x86_64-abi-0.21.pdf ), but then converting SSEUP to SSE because it thinks it isn't preceded by an SSE argument (which it does... the lower two floats).  Maybe my interpretation is wrong, but it shouldn't need to split it across 2 registers like that.  Can someone with more experience of the Linux ABI shed some light on that?

Request to BeanzMaster - as well as timing checks, can you also implement some verification in your benchmark program? I have a feeling that some functions return incorrect results. Failing that, I can possibly design something a little more in-depth once I've finished my current task.
Title: Re: AVX and SSE support question
Post by: dicepd on December 03, 2017, 01:00:45 am
It is possible they do that for safety, on 64 bit machines memory alignment is normally 8 bytes whereas lots of exceptions get raised from SSE for bad memory alignement on xmm -> mem transfers.
Title: Re: AVX and SSE support question
Post by: CuriousKit on December 03, 2017, 01:04:28 am
Speaking of that, what happens if you specify {$ALIGN 16} for your vector type?
Title: Re: AVX and SSE support question
Post by: dicepd on December 03, 2017, 01:19:15 am
I'll have a look, we already had to do that for the consts as we hit bad alignments, it may change the behaviour of the calls, interesting question ;D
Title: Re: AVX and SSE support question
Post by: CuriousKit on December 03, 2017, 01:23:18 am
I'll have a look, we already had to do that for the consts as we hit bad alignments, it may change the behaviour of the calls, interesting question ;D
It's a thought because if such memory alignment is forced (such vectors should be aligned that way anyway, because they're 16 bytes in length overall), you can potentially replace your MOVUPS calls with MOVAPS calls for an extra speed gain.
Title: Re: AVX and SSE support question
Post by: dicepd on December 03, 2017, 01:29:46 am
Ok setting
Code: Pascal  [Select][+][-]
  1. {$CODEALIGN RECORDMIN=16}  

certainly breaks some things, so may be promising.

Title: Re: AVX and SSE support question
Post by: CuriousKit on December 03, 2017, 01:41:07 am
Ummm, check that that isn't aligning it for individual fields (I'm not sure if 'packed' overrides it anyway, in which case it will coincidentally work since it will align the first field).

Actually, that is exactly what $ALIGN does.  Hmmm... what directive forces memory alignment for a particular type?
Title: Re: AVX and SSE support question
Post by: CuriousKit on December 03, 2017, 02:20:38 am
I might have found the answer.  This topic might be of interest to you - it seems that there's a somewhat undocumented feature for records in Pascal that controls memory alignment: https://forum.lazarus.freepascal.org/index.php/topic,27400.msg169251.html#msg169251
Title: Re: AVX and SSE support question
Post by: dicepd on December 03, 2017, 12:33:14 pm
I have done some tests and whatever I do I cannot get xmm tramsfers to align. Unix 64 will pass Const vectors in two xmm registers and return in xmm0 xmm1

However in my attempts to get things aligned I found something very interesting. I went back to the small test app to play with just a single function.

One idea I had was the following, if I can't get records to align (test for this was to change
Code: Pascal  [Select][+][-]
  1.   MOVUPS XMM2, [V1]    // get in one hit V1
  2. // to
  3. MOVAPS XMM2, [V1]    // get in one hit V1
  4.  
If the test seg faulted then record was not aligned and inspecting the rdi reg in this case was always 0x0????[8|4] where we need the last digit to be always 0 for aligned transfer.

So lets try making and aligned array, arrays have been around forever so they must align.

Align flags used

Code: Pascal  [Select][+][-]
  1. {$CODEALIGN CONSTMIN=16}
  2. {$CODEALIGN VARMIN=16}
  3. {$CODEALIGN RECORDMAX=4}
  4. {$CODEALIGN LOCALMIN=16}
  5.  
  6. {$define USE_ARRAY}
  7. {.$define USE_RECORD_V}
  8.  
  9.   {$ifdef USE_ARRAY}
  10.   TGLZVector4f = packed array[0..3] of Single;
  11.   {$else}
  12.   TGLZVector4fType = packed array[0..3] of Single;
  13.   TGLZVector4f = record
  14.     case Byte of
  15.        0: (V: TGLZVector4fType);
  16.        1: (X, Y, Z, W: Single);
  17.       //2: (AsVector3f : TGLZVector3f);
  18.   End;
  19.   {$endif}                              
  20.  

That really suprised me in that it gave a 4* speedup across the board, native and SSE.  :D
On the right track here I thought, checked the calling regs, hmm still passing vectors in two regs both ways.

Ok then, back to ConstRef and try a register address movups / movaps, nope nothing aligned. Why the speedup? Also would this invalidate using records? Ok lets test moving data around as TGLZVector4fType thus keeping the record usage. That worked with simliar speedups to just using an array.

So here is the test harness, this code may not work on other platforms but you could just substitute the code with something that works on your abi.

I am going to try to find out why I get a four times improvement, I presume there must be code elsewhere which has changed to give this speedup.
Title: Re: AVX and SSE support question
Post by: CuriousKit on December 03, 2017, 01:29:18 pm
Don't underestimate the power of aligned memory. There's a reason why a lot of code segments are aligned on 16-byte boundaries, and not just the tops of procedures - if you look at the disassembled code of a for-loop, for example, you'll find that the top of the loop is aligned to a 16-byte boundary, with preceding bytes filled with NOP instructions if necessary.

(And turns out that Free Pascal doesn't support the "align" modifier as specified in the link a few posts pack)
Title: Re: AVX and SSE support question
Post by: photor on December 03, 2017, 01:38:38 pm
got it :)
Title: Re: AVX and SSE support question
Post by: dicepd on December 03, 2017, 02:13:48 pm
Ok just tested this on win7 64 and it goes the other way by a very small margin. Not enough difference to make a real call on. Time to get trunk and see what happens in linux 64 there.
Title: Re: AVX and SSE support question
Post by: dicepd on December 03, 2017, 02:44:03 pm
Ok looking a compiler sources it seems we will never get parameter support for single move. So trying to get alignment and using movaps with a ConstRef will be the quickest for larger structures.

How the compiler classifies s128 floattype arguments
Code: Pascal  [Select][+][-]
  1.  s128real:
  2.   begin
  3.     classes[0].typ:=X86_64_SSE_CLASS;
  4.     classes[0].def:=carraydef.getreusable_no_free(s32floattype,2);
  5.     classes[1].typ:=X86_64_SSEUP_CLASS;
  6.     classes[1].def:=carraydef.getreusable_no_free(s32floattype,2);
  7.     result:=2;
  8.  end;
  9.  

so I do not know how Jerome is getting good returns in Win10?

AVX moves are going to be interesting

Code: Pascal  [Select][+][-]
  1.           else
  2.             { 4 can only happen for _m256 vectors, not yet supported }
  3.             internalerror(2010021501);
  4.         end;
  5.       end;
Title: Re: AVX and SSE support question
Post by: dicepd on December 03, 2017, 06:41:57 pm
Right finally got movaps everywhere in my test app.

will need a lot of this sort of layout in any classes.

Code: Pascal  [Select][+][-]
  1.   Tform1 = Class(Tform)
  2.     Button1: TButton;
  3.     Label1 : Tlabel;
  4.     Label2 : Tlabel;
  5.     Label3 : Tlabel;
  6.     procedure Button1Click(Sender: TObject);
  7.     Procedure Formcreate(Sender : Tobject);
  8.     Procedure Formshow(Sender : Tobject);
  9.   Private
  10.   Public
  11.   {$CODEALIGN RECORDMIN=16}
  12.   vt1,vt2, vt3 : TGLZVector4f;
  13.   {$CODEALIGN RECORDMIN=4}
  14.    Fs1,Fs2 : Single;
  15.   {$CODEALIGN RECORDMIN=1}
  16.   // .... whatever here booleam etc
  17.   End;                                                    
Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 03, 2017, 06:54:02 pm
Hi to all


so I do not know how Jerome is getting good returns in Win10?

I can't answer, i'm just coding so.........

Anayway, i've splitted all asm code into 6 includes file (one for each case Linux/Windows 32/64bit SSE/AVX), better reading and better for debugging
plus i've added 2 of my others units, i've beginning some other little tests)
I've corrected some little spelling bugs and putted some little comments

I've tested 32bit with Lazarus 1.8rc3 but some errors occured :
1st the clamp functions work but raise a SIGSEV just after
2nd the function with single result. The result is stored in ST register, i tried to set it with FTSP intruction, but without success

I'm also add some conditionnals commands for alignment, replaced MOVUPS by MOVAPS and it work.  I've also added 2/3 others little functions, and added AngleBetween in asm but not tested yet
The performance varying and depends of the compiler's options and how record is set (packed or not)
The best results I've got are with SSE4/SSE3, not with AVX so i think they're will be better with matrix manipulation.

Peter i don't include your change for Unix, i can't test and don't know where exactly.

I've also tested your sample, it work in 32bit with Laz1.8rc3 but not in Laz1.8rc4 64bit. The better result i haved, was with {$define USE_RECORD_V}

Now I have a headache ! Next i'll begin some tests with Arrays,  matrix and quaternion

Request to BeanzMaster - as well as timing checks, can you also implement some verification in your benchmark program? I have a feeling that some functions return incorrect results. Failing that, I can possibly design something a little more in-depth once I've finished my current task.


Yes later, one of the first needed is check the divide by 0. Otherwise compared to the native code the results are good

Title: Re: AVX and SSE support question
Post by: dicepd on December 03, 2017, 07:05:13 pm

I've tested 32bit with Lazarus 1.8rc3 but some errors occured :
1st the clamp functions work but raise a SIGSEV just after
2nd the function with single result. The result is stored in ST register, i tried to set it with FTSP intruction, but without success


That is usually a sign of stack corruption, such as moving a whole 128 bit mmx reg when there is only space for 32 or 64 bytes. Usually I have found that if the variable is on the stack in 32 bits the stack contains a pointer and not the variable so need a

mov eax, stackedvar
mov [eax], xmm reg
 
Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 03, 2017, 08:32:09 pm

That is usually a sign of stack corruption, such as moving a whole 128 bit mmx reg when there is only space for 32 or 64 bytes. Usually I have found that if the variable is on the stack in 32 bits the stack contains a pointer and not the variable so need a

mov eax, stackedvar
mov [eax], xmm reg

  mov ecx, RESULT
  mov [ecx], xmm0

not working : vectormath_vector_win32_sse_imp.inc(269,5) Error: Asm: [mov mem??,xmmreg] invalid combination of opcode and operands

and this is what i have in the S file :

Quote
.globl   GLZVECTORMATH$_$TGLZVECTOR4F_$__$$_DISTANCE$TGLZVECTOR4F$$SINGLE
GLZVECTORMATH$_$TGLZVECTOR4F_$__$$_DISTANCE$TGLZVECTOR4F$$SINGLE:
   # Register ebp allocated
# [258] Asm
   pushl   %ebp
   movl   %esp,%ebp
   leal   -4(%esp),%esp
# Var A located in register edx
# Var $self located in register eax
# Temp -4,4 allocated
# Var $result located at ebp-4, size=OS_F32
   # Register eax,ecx,edx allocated

another example this do not work too

Code: Pascal  [Select][+][-]
  1. class operator TGLZVector4f.+(constref A: TGLZVector4f; constref B:Single): TGLZVector4f;assembler; //nostackframe; register;
  2. asm
  3.   movups xmm0,[A]
  4.   movss  xmm1,[B]
  5.   shufps xmm1, xmm1, $00
  6.   addps  xmm0,xmm1
  7.   movaps [RESULT], xmm0
  8. end;

Quote
# Var A located in register eax
# Var B located in register edx
# Var $result located in register ecx
   # Register eax,ecx,edx allocated

Those errors are boring  >:D So perhaps by making and external object library with masm or  nasm/yasm, will be better than use internal asm ???
Title: Re: AVX and SSE support question
Post by: dicepd on December 04, 2017, 02:55:07 am
Quote
Those errors are boring  >:D So perhaps by making and external object library with masm or  nasm/yasm, will be better than use internal asm ???

Still have to conform to pascal calling conventions so not much gain in doing so probably spend more time trying to get your params to your lib correctly..

I am writing some test cases, mark what is bad carry on coding and I'll try to sort out the 'annoying' errors.
Title: Re: AVX and SSE support question
Post by: dicepd on December 04, 2017, 03:06:29 am
As for this I have got this in unix64 should work for win64 I think from previous testing.

Code: Pascal  [Select][+][-]
  1.   class operator TGLZVector4f.+(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler; nostackframe; register;
  2. asm
  3.   movaps xmm0,[A]
  4.   movss  xmm1,[B]
  5.   shufps xmm1, xmm1, $00
  6.   addps  xmm0,xmm1
  7.   movhlps xmm1,xmm0
  8. end;              
  9.  
Title: Re: AVX and SSE support question
Post by: dicepd on December 04, 2017, 03:30:33 am
Re comparison operators, in the pure pascal code as I read it every element must pass the comparison test, that was not happening in the case that one element failed in the asm. So it passed my tests with the following which also avoids branching. Comments please before I change a lot of code.
Code: Pascal  [Select][+][-]
  1.  
  2.     cmpps  xmm0, xmm1, cSSE_OPERATOR_LESS_OR_EQUAL  
  3.     movmskps eax, xmm0     // copies a 4 bit mask to eax
  4.     xor eax, $f    // only 1111 should should be correct for anded compares.
  5.     setz al          // true if zero            
  6.  

Edit 1 Negate fails tests that mask is doing a multiply by -1 not setting all items negative as the pascal code. Though I suspect the pascal code is wrong. Never had a use for setting all negative whereas *-1 is vector reverse.
Title: Re: AVX and SSE support question
Post by: dicepd on December 04, 2017, 05:25:46 am

  mov ecx, RESULT
  mov [ecx], xmm0

not working : vectormath_vector_win32_sse_imp.inc(269,5) Error: Asm: [mov mem??,xmmreg] invalid combination of opcode and operands

and this is what i have in the S file :

Quote
.globl   GLZVECTORMATH$_$TGLZVECTOR4F_$__$$_DISTANCE$TGLZVECTOR4F$$SINGLE
GLZVECTORMATH$_$TGLZVECTOR4F_$__$$_DISTANCE$TGLZVECTOR4F$$SINGLE:
   # Register ebp allocated
# [258] Asm
   pushl   %ebp
   movl   %esp,%ebp
   leal   -4(%esp),%esp
# Var A located in register edx
# Var $self located in register eax
# Temp -4,4 allocated
# Var $result located at ebp-4, size=OS_F32
   # Register eax,ecx,edx allocated

another example this do not work too


This one is easy you should be returning a single not a 128 bit record, use MOVSS not movaps

Peter
Title: Re: AVX and SSE support question
Post by: CuriousKit on December 04, 2017, 01:37:23 pm
I'm having some difficulty compiling the latest version of the unit from BeanzMaster - the GLZTypes unit has an awkward dependency on GLZVectorMath and others, since TGLZVector and TGLZVector2i are not defined.  It's easy enough to fix, but it means that GLZTypes is not self-contained.
Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 04, 2017, 03:57:17 pm
Quote
Those errors are boring  >:D So perhaps by making and external object library with masm or  nasm/yasm, will be better than use internal asm ???

Still have to conform to pascal calling conventions so not much gain in doing so probably spend more time trying to get your params to your lib correctly..

I am writing some test cases, mark what is bad carry on coding and I'll try to sort out the 'annoying' errors.

I'm finding the issue is with 32bit result is not aligned so i had "movaps [RESULT], xmm0" instead of "movups [RESULT], xmm0" it's working
Under 64bit no problem, RESULT is aligned. But alway a problem with the clamp, lerp, combine, combine 2/3 functions. All others are ok in 32bit

As for this I have got this in unix64 should work for win64 I think from previous testing.

Code: Pascal  [Select][+][-]
  1.   class operator TGLZVector4f.+(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler; nostackframe; register;
  2. asm
  3.   movaps xmm0,[A]
  4.   movss  xmm1,[B]
  5.   shufps xmm1, xmm1, $00
  6.   addps  xmm0,xmm1
  7.   movhlps xmm1,xmm0
  8. end;              
  9.  

Huch, you have the right Result with this ? because movhlps moving WZ value to XY value ??? if i well understood under Linux64 result is splitted  and the right result is, low in xmm0  and the high is in xmm1, i'm correct ?

Re comparison operators, in the pure pascal code as I read it every element must pass the comparison test, that was not happening in the case that one element failed in the asm. So it passed my tests with the following which also avoids branching. Comments please before I change a lot of code.
Code: Pascal  [Select][+][-]
  1.  
  2.     cmpps  xmm0, xmm1, cSSE_OPERATOR_LESS_OR_EQUAL  
  3.     movmskps eax, xmm0     // copies a 4 bit mask to eax
  4.     xor eax, $f    // only 1111 should should be correct for anded compares.
  5.     setz al          // true if zero            
  6.  

Edit 1 Negate fails tests that mask is doing a multiply by -1 not setting all items negative as the pascal code. Though I suspect the pascal code is wrong. Never had a use for setting all negative whereas *-1 is vector reverse.


I've tested it work, but result is wrong

 if v1 = v2 then Cells[1,25] := 'TRUE' else Cells[1,25] := 'FALSE';   

the ZEROFLAG is not set under 64bit so always return TRUE, but with 32bit your function is ok and return the right result

For negate you have right, under 64bit the result is wrong normaly in our sample the sign of the Y value should change. Under 32bit the function return the correct result. 
For X*-1 is equal as 0 - X so i've choose this latest Sub is normaly fastest than Mul.

I'm having some difficulty compiling the latest version of the unit from BeanzMaster - the GLZTypes unit has an awkward dependency on GLZVectorMath and others, since TGLZVector and TGLZVector2i are not defined.  It's easy enough to fix, but it means that GLZTypes is not self-contained.

Ouch sorry i've forget to delete the TGLZVectorX in the GLZType, this unit is only used by GLZMath, this is due because i've added the MinXYZComponent and the MAxXYZComponent and this 2 using the function Min3s and Max3s in GLZMath unit; So you can just copy / past this 2 functions in GLZVectorMath unit and delete the dependency of the GLZMath unit. Or simply comment the  MinXYZ/MaxXYZComponent  functions  :-[. This comes from my own project. Sorry   %)
Title: Re: AVX and SSE support question
Post by: CuriousKit on December 04, 2017, 06:58:33 pm
It happens. In the meantime, I'm writing my own test kit, pulling on my experience in SQA. It might be additional effor since there are two test kits, but it means we get to doubly test your library for correctness and robustness and I have a framework from which to include and test my vector array functions.

One minor difference though is that in my own library, the CPU capabilities are checked upon program initialisation (via CPUID) and the best procedures selected based on what's available, using function pointers and inline wrappers. It also allows me to test and compare performance in just one cycle of the test kit by manually selecting which version of SSE or AVX to use. Makes things more complicated and greatly increases code size, but ensures it works on all platforms while taking advantage of modern features if they're available.
Title: Re: AVX and SSE support question
Post by: dicepd on December 04, 2017, 08:45:58 pm
Quote
For negate you have right, under 64bit the result is wrong normaly in our sample the sign of the Y value should change. Under 32bit the function return the correct result.
For X*-1 is equal as 0 - X so i've choose this latest Sub is normaly fastest than Mul.

So are you saying the Native Pascal code is wrong?

Quote
've tested it work, but result is wrong

 if v1 = v2 then Cells[1,25] := 'TRUE' else Cells[1,25] := 'FALSE';   

the ZEROFLAG is not set under 64bit so always return TRUE, but with 32bit your function is ok and return the right result

Ok I am not understanding your reply here. All I am doing is checking asm code reflects what the Pascal code does. I am currently only testing linux 64 bit and that asm code works in 64 bit for me.

Anyway all of this is getting confusing. So here is the code I am using for testing, you may find it useful.
It is using FPCUnit and has a Gui Runner and a command line runner for the tests. Basically I have recreated native class by copying the inc file with a quick rename of class so I can run pascal and assembly side by side and do comparison in the same code base. To test different compiler options, you have to change the options in the test project at the moment but it should be possible to automate by building multiple copies from the command line with differing parameters to the compiler.

Also you could just copy the lpi and rename it to reflect the build options that lpi uses for example wi32SSE win32AVX etc and open that project for testing.

Hacking the inc file is manual atm but as it is just a single search and replace of  TGLZVector4f to TNativeGLZVector4f a quick sed line in an automated test script would be all that would be needed.

It is a bit hard coded to sitting in folder alongside the code to be tested but nothing that can't be overcome. It is just a first attempt and it does make tracking down issues much easier and more reliable than eyeballing results on a screen.

I  have included you code in this so hopefully it just works out of the box and that code contains unix64 mods. Only three failures in whats in the test script both negate and pnegate and reflect.

Of course I still have to finish off the comparisons. As I am unsure of what your answer above means.
Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 05, 2017, 12:02:21 am
Hi Peter,

Your unit test is magic, I do not know this  :D I'm just adding Vector.ToString and FloatToStrF for see the result with my eyes

I make some test and correcting the Win64_SSE
All tests are now correct at home except 'AngleBetween' and 'reflect' but I think we can say that 'reflect' is correct

Quote
Vector Reflects do not match : Native = (X: 171.54222 ,Y: 677.06671 ,Z: 489.74261 ,W: 107.84930) --> SSE = (X: 171.54224 ,Y: 677.06677 ,Z: 489.74265 ,W: 107.84931)
Like you see the result is very, very near. For AngleBeetween SSE return me NAN  :'(

So are you saying the Native Pascal code is wrong?
Quote

Yes, in real for me, with the the native code should be like invert or class operator -

This 2 code give me the same result now :
Code: Pascal  [Select][+][-]
  1. procedure TNativeGLZVector4f.pNegate;
  2. begin
  3.   //if Self.X>0 then
  4.   Self.X := -Self.X;
  5.   //if Self.Y>0 then
  6.   Self.Y := -Self.Y;
  7.   //if Self.Z>0 then
  8.   Self.Z := -Self.Z;
  9.   //if Self.W>0 then
  10.   Self.W := -Self.W;
  11. end;
  12.  
  13. procedure TGLZVector4f.pNegate; assembler; nostackframe; register;
  14. asm
  15.   movaps xmm0,[RCX]
  16.   xorps xmm0, [RIP+cSSE_MASK_NEGATE]
  17.   movaps [RCX],xmm0
  18. End;
  19.  

but i'm little disturb by this because with my previous test result was correct to the native, like we see here http://forum.lazarus.freepascal.org/index.php/topic,32741.msg267332.html#msg267332 (http://forum.lazarus.freepascal.org/index.php/topic,32741.msg267332.html#msg267332)
on the 2nd screenshot (on the 1st screenshot result are different)  :o so now i don't say exactly what's the real correct result

I'm also synchronize with your UNIX64_SSE, the  EQUAL function and this is work (not tested with SSE4 but should work to)

Code: Pascal  [Select][+][-]
  1. class operator TGLZVector4f.= (constref A, B: TGLZVector4f): boolean; assembler; nostackframe; register;
  2. asm
  3.   movaps xmm1,[A]
  4.   movaps xmm0,[B]
  5.   {$IFDEF USE_ASM_SSE_4}
  6.     cmpps xmm0,xmm1, cSSE_OPERATOR_EQUAL
  7.     ptest    xmm0, xmm1
  8.     jnz @no_differences
  9.     mov [RESULT],FALSE
  10.     jmp @END_SSE
  11.   {$ELSE}
  12.     cmpps  xmm0, xmm1, cSSE_OPERATOR_EQUAL    //  Yes: $FFFFFFFF, No: $00000000 ; 0 = Operator Equal
  13.     movmskps  eax, xmm0
  14.     test  eax, eax
  15.     setnz al
  16.   {$ENDIF}
  17. end;

Tommorrow if i have the time i'll make testunit with Win32

Many thanks and great work Peter like always 8-)
Title: Re: AVX and SSE support question
Post by: dicepd on December 05, 2017, 12:13:46 am
Code: Pascal  [Select][+][-]
  1.     cmpps  xmm0, xmm1, cSSE_OPERATOR_EQUAL    //  Yes: $FFFFFFFF, No: $00000000 ; 0 = Operator Equal
  2.     movmskps  eax, xmm0
  3.     test  eax, eax
  4.     setnz al

This code will return equal if only one item is equal not all items equal.

Quote
ector Reflects do not match : Native = (X: 171.54222 ,Y: 677.06671 ,Z: 489.74261 ,W: 107.84930) --> SSE = (X: 171.54224 ,Y: 677.06677 ,Z: 489.74265 ,W: 107.84931)

For this you can adjust the epsilon in the test,  as in

Code: Pascal  [Select][+][-]
  1. Compare(nt1,vt1, 1e-5)

See definition of Compare
Code: Pascal  [Select][+][-]
  1.   function Compare(constref A: TNativeGLZVector4f; constref B: TGLZVector4f;Espilon: Single = 1e-10): boolean;
  2.  

So you can override the resolution of the test.
Title: Re: AVX and SSE support question
Post by: dicepd on December 05, 2017, 12:44:34 am
Ok I just tested the code I provided before on win64 and it works for me.

Code: Pascal  [Select][+][-]
  1.     cmpps  xmm0, xmm1, cSSE_OPERATOR_LESS_OR_EQUAL    //  Yes: $FFFFFFFF, No: $00000000 ; 2 = Operator Less or Equal
  2.     movmskps eax, xmm0
  3.     xor eax, $F
  4.     setz al        

What gets returned in EAX is a mask of matched tests. So you could get 1010 in EAX which means x and z are less or equal but y and w are greater.

Though the test runner is so SLOW in windows.

Code: Pascal  [Select][+][-]
  1. 22:58:04 - Running All Tests
  2. 22:58:16 - Number of executed tests: 61  Time elapsed: 00:00:12.436

compared to linux

Code: Pascal  [Select][+][-]
  1. 2:13:25 - Running All Tests
  2. 12:13:26 - Number of executed tests: 61  Time elapsed: 00:00:00.149
Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 05, 2017, 03:44:12 pm
Ok I just tested the code I provided before on win64 and it works for me.

Code: Pascal  [Select][+][-]
  1.     cmpps  xmm0, xmm1, cSSE_OPERATOR_LESS_OR_EQUAL    //  Yes: $FFFFFFFF, No: $00000000 ; 2 = Operator Less or Equal
  2.     movmskps eax, xmm0
  3.     xor eax, $F
  4.     setz al        

What gets returned in EAX is a mask of matched tests. So you could get 1010 in EAX which means x and z are less or equal but y and w are greater.


Work for me to :) for testing i've used :
Code: Pascal  [Select][+][-]
  1.  
  2.  vt1.Create(2,  7,  -6, 3);
  3.  vt2.Create(1, 12,  -6, 8);

But always 1 failure --> Vector AngleBetweens do not match : 1.932 --> Nan

Though the test runner is so SLOW in windows.

Code: Pascal  [Select][+][-]
  1. 22:58:04 - Running All Tests
  2. 22:58:16 - Number of executed tests: 61  Time elapsed: 00:00:12.436

compared to linux

Code: Pascal  [Select][+][-]
  1. 2:13:25 - Running All Tests
  2. 12:13:26 - Number of executed tests: 61  Time elapsed: 00:00:00.149


Yes with windows is slow:
Code: Pascal  [Select][+][-]
  1. 15:36:46 - Running All Tests
  2. 15:36:47 - Number of executed tests: 61  Time elapsed: 00:00:00.859

I've begining tests with win32 i'm on the right way, some errors, but if i have enought time tonight i'm think i'll can correct all
Title: Re: AVX and SSE support question
Post by: dicepd on December 05, 2017, 05:51:45 pm
Ok Jerome,

Here is linux 64 with test harness for SSE SSE3 SSE4 and AVX.

Finished off the rest of the tests for comparison operators.

100% pass rate in all tests for linux 64 across all settings. I have placed -dUSE_ASM etc in project files so I do not have to comment/uncomment defines in the code. Just open the project and it all looks good with highlighting also showing that the right settings are there.

Not saying it is the most efficient atm, just that it works! and is a good starting point for fine tuning :D

Next on my list now is make timing tests in a similar manner. and a small framework for developing new functions. I have one function I want to get done as my program spends 30-40% of its time in this one function, according to callgrind.

Code: Pascal  [Select][+][-]
  1. function TCutPlane.GetNorm(cen, up, left, down, right: PAffineVector
  2.   ): TAffineVector;
  3. var
  4.   s,t,u,v: TAffineVector;
  5. begin
  6.   VectorSubtract(up^,cen^,s{%H-});
  7.   VectorSubtract(left^,cen^,t{%H-});
  8.   VectorSubtract(down^,cen^,u{%H-});
  9.   VectorSubtract(right^,cen^,v{%H-});
  10.  
  11.   Result.X := s.Y*t.Z - s.Z*t.Y + t.Y*u.Z - t.Z*u.Y + u.Y*v.Z - u.Z*v.Y + v.Y*s.Z - v.Z*s.Y;
  12.   Result.Y := s.Z*t.X - s.X*t.Z + t.Z*u.X - t.x*u.Z + u.Z*v.X - u.X*v.Z + v.Z*s.X - v.X*s.Z;
  13.   Result.Z := s.X*t.Y - s.Y*t.X + t.X*u.Y - t.Y*u.X + u.X*v.Y - u.Y*v.X + v.X*s.Y - v.Y*s.X;
  14.   NormalizeVector(Result);
  15. end;

Title: Re: AVX and SSE support question
Post by: dicepd on December 05, 2017, 07:07:21 pm
Quote
But always 1 failure --> Vector AngleBetweens do not match : 1.932 --> Nan

Pass your win64 updated code and I will have a look, this one was tricky as it is not a pure asm funtion and the parameter ordering is so different when not a pure asm, most are on the stack and need a mov pointer to register before loading into mmx register.
Title: Re: AVX and SSE support question
Post by: dicepd on December 05, 2017, 08:23:50 pm
Ok I got AngleBetween working in win64.

here is the code to load the mmx regs correctly.

Code: Pascal  [Select][+][-]
  1.    
  2.     movaps xmm0,[RCX]       //self is still in rcx
  3.     mov rax, [A]            // A is a pointer on the stack
  4.     movups xmm1, [RAX]
  5.     mov rax, [ACenterPoint] // ACenterPoint is a pointer on the stack
  6.     movups xmm2, [RAX]                
  7.  

Peter
Title: Re: AVX and SSE support question
Post by: dicepd on December 05, 2017, 11:04:29 pm
Best results so far for me now everything works. For SSE 2 I have found the best compiler flags are:

Quote
-CfSSE3
-Sv
-dUSE_ASM

Others seem to have no difference or make things worse (esp COREAVX avoid like the plague)

Some initial results, not final report style yet

Code: Pascal  [Select][+][-]
  1. TimeAddNative:  : 0.222999695688486 seconds
  2. TimeAddAsm:     : 0.0509998993948102 seconds
  3.  
  4. TimeSubNative:  : 0.219000270590186 seconds
  5. TimeSubAsm:     : 0.0520000699907541 seconds
  6.  
  7. TimeMulNative:  : 0.220999983139336 seconds
  8. TimeMulAsm:     : 0.0520000699907541 seconds
  9.  

not bad speedups for such simple routines.
Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 06, 2017, 01:12:42 am
I'v finished all tests on Win64 sse/3/4 and avx (i've also updated a little bit avx, synchronized Distance and length with SSE4 instructions)
I've also finished tests on win32  SSE. Need to check SSE3/4 and I'll do AVX tests tomorrow if i can and i'll post the updated code. In waiting

Testunit result for 64bit
Code: Pascal  [Select][+][-]
  1. 00:56:38 - Running All Tests
  2. 00:56:38 - Number of executed tests: 68  Time elapsed: 00:00:00.124
  3.  

Testunit result for 32bit
Code: Pascal  [Select][+][-]
  1. 01:05:10 - Running All Tests
  2. 01:05:11 - Number of executed tests: 68  Time elapsed: 00:00:00.155
  3.  

much better now  :)

Just a thing i'm not understing well is your trick with "movhlps xmm1,xmm0 " it an issue with stack, but something escapes me. can you re-explain me ?

for
Code: Pascal  [Select][+][-]
  1. function TCutPlane.GetNorm(cen, up, left, down, right: PAffineVector): TAffineVector;

this is what i'm beginning :

Code: Pascal  [Select][+][-]
  1. function GetNormFromCutPlane(cen, up, left, down, right: TGLZVector4f): TGLZVector4f;
  2. //  s,t,u,v: xmm2,xmm3, xmm4, xmm5
  3. asm
  4.   movaps xmm2, [Cent] //s
  5.   movaps xmm3, xmm2   //t
  6.   movaps xmm4, xmm2   //u
  7.   movaps xmm5, xmm2   //v
  8.  
  9.   //VectorSubtract(up^,cen^,s{%H-});
  10.   movaps xmm1, [up]
  11.   subps xmm2, xmm1
  12.   //VectorSubtract(left^,cen^,t{%H-});
  13.   movaps xmm1, [left]
  14.   subps xmm3, xmm1
  15.   //VectorSubtract(down^,cen^,u{%H-});
  16.   movaps xmm1, [down]
  17.   subps xmm4, xmm1
  18.   //VectorSubtract(right^,cen^,v{%H-});
  19.   movaps xmm1, [right]
  20.   subps xmm5, xmm1
  21.  
  22.   andps xmm2, [RIP+cSSE_MASK_NO_W]
  23.   andps xmm3, [RIP+cSSE_MASK_NO_W]
  24.   andps xmm4, [RIP+cSSE_MASK_NO_W]
  25.   andps xmm5, [RIP+cSSE_MASK_NO_W]
  26.  
  27.   //------------------------------------
  28.   // X := s.Y*t.Z,
  29.   // Y := s.Z*t.X,
  30.   // Z := s.X*t.Y
  31.   // S =   w,z,y,x
  32.   // T = * -,x,z,y
  33.   shufps xmm6, xmm3, 11001001b
  34.   mulps xmm6,xmm2
  35.  
  36.   // X := s.Z*t.Y
  37.   // Y := s.X*t.Z
  38.   // Z := s.Y*t.X
  39.   // S =   w,z,y,x
  40.   // t = * -,y,x,z
  41.   shufps xmm7, xmm3, 11010010b
  42.   mulps xmm7,xmm2
  43.  
  44.   //xmm6 = w,x,z,y
  45.   //xmm7 = w,y,x,z
  46.   subps xmm6,xmm7
  47.   movaps xmm0, xmm6
  48.   //-------------------------------------
  49.  
  50.   //  xmm0        =      xmm6       +        xmm7         +         xmm8        +         xmm2
  51.   //Result.X := (s.Y*t.Z - s.Z*t.Y) + (t.Y*u.Z - t.Z*u.Y) + (u.Y*v.Z - u.Z*v.Y) + (v.Y*s.Z - v.Z*s.Y);
  52.   //Result.Y := (s.Z*t.X - s.X*t.Z) + (t.Z*u.X - t.x*u.Z) + (u.Z*v.X - u.X*v.Z) + (v.Z*s.X - v.X*s.Z);
  53.   //Result.Z := (s.X*t.Y - s.Y*t.X) + (t.X*u.Y - t.Y*u.X) + (u.X*v.Y - u.Y*v.X) + (v.X*s.Y - v.Y*s.X);
  54.  
  55.   addps xmm0,xmm7
  56.   addps xmm0,xmm8
  57.   addps xmm0,xmm2
  58.  
  59.   //NormalizeVector(Result);
  60. end;

EDIT : I'm also tried to make compare with sse4 PTEST instruction but don't say how without a jump. I found an interesting article
https://stackoverflow.com/questions/34951714/simd-instructions-for-floating-point-equality-comparison-with-nan-nan but not understand all very well  :-[


Title: Re: AVX and SSE support question
Post by: CuriousKit on December 06, 2017, 04:51:21 am
Yeah, it's a little complex, but I think what they're trying to get at is that they combine the results of an IEEE equality (i.e. floating-point "is equal to") and an integer equality (what they call bitwise-equal, but is actually just checking two 32-bit integers for identical values, which are the bit representations of the floating-point numbers, including NaNs).

Intuitively, the results would be combined with logical OR (actually bitwise OR because the results are either all 0s or all 1s), but because of the results of CMPNEQPS and PCMPEQD, they spell out the truth table to prove that the combining operation is ANDN instead.

I'm not certain, but there might be a slight performance penalty if you switch between floating-point and integer processing within the same vector processing unit - this is why there are different opcodes for MOVDQA and MOVAPS, for example, even though they both move 128 bits from aligned memory into an XMM register.

Whether you need a jump or not depends on the code.  If you just need to set a result based on the zero flag, then you can use SETZ or SETNZ. There's no straight answer.
Title: Re: AVX and SSE support question
Post by: dicepd on December 06, 2017, 08:10:49 am

Just a thing i'm not understing well is your trick with "movhlps xmm1,xmm0 " it an issue with stack, but something escapes me. can you re-explain me ?


Ok this is all to do with return conventions in linux 64 ( SysV x86_64 to be exact), just as win64 has it's 4 registers rest on stack etc.

Spec was kindly sourced by CuriousKit as from this post:

That's a little confusing with Linux, because the way it's behaving implies that it's splitting the 128-bit into two, classing the lower half as SSE and the upper half as SSEUP (see pages 15-17 here: http://refspecs.linuxbase.org/elf/x86_64-abi-0.21.pdf ), but then converting SSEUP to SSE because it thinks it isn't preceded by an SSE argument (which it does... the lower two floats).  Maybe my interpretation is wrong, but it shouldn't need to split it across 2 registers like that.  Can someone with more experience of the Linux ABI shed some light on that?

There are two type identifiers for SSE values as parameters. 
X86_64_SSE_CLASS This signifies[is a pointer / address of] the first 64 bits of a 128 bit SSE value
X86_64_SSEUP_CLASS This signifies[is a pointer / address of] the next 64 bits of a 128 bit SSE value
There can be more that one of these for 256 bit.

One thing you have to keep in the back of your mind in any unix environment when writing code at this level is you have to take into account endianness and not write code which is based on one arch. So this seems to be the way that Unix V deals with this and thus gcc does, and therefore everyone else does. (you would want your libs to link wouldn't you?)

Anyway as seen from the fpc compiler code:
Code: Pascal  [Select][+][-]
  1.  s128real:
  2.   begin
  3.     classes[0].typ:=X86_64_SSE_CLASS;
  4.     classes[0].def:=carraydef.getreusable_no_free(s32floattype,2);
  5.     classes[1].typ:=X86_64_SSEUP_CLASS;
  6.     classes[1].def:=carraydef.getreusable_no_free(s32floattype,2);
  7.     result:=2;
  8.  end;
  9.  

This is exactly how the compiler see's a 128 bit real. So in general terms if we did not use nostackframe then at times the result was placed/wanted on the stack by the fpc compiler. Unlike other platforms ithe stack did not contain a pointer it had allocated 128 bits on the stack for the contents of the mmx reg to be copied to.

After the return from assembler the compiler then did a movq on each of the  two qword and placed the X86_64_SSE_CLASS in  low xmm0 and the X86_64_SSEUP_CLASS in low xmm1. It does this for routines it generates itself. here is the post amble of native pascal version of operator + that does not even use mmx regs for its calcs.

Code: Pascal  [Select][+][-]
  1. # Var $result located at rsp+0, size=OS_128
  2. .........
  3. # [158] End;
  4.         movq    (%rsp),%xmm0
  5.         # Register xmm1 allocated
  6.         movq    8(%rsp),%xmm1
  7.         leaq    24(%rsp),%rsp
  8.         # Register rsp released
  9.         ret
  10.         # Register xmm0,xmm1 released
  11. .Lc12:
  12.  

When we use nostackframe the above postamble does not occur. Therefore we got good values for x and y [low xxm0] but garbage for z and w, the calling convention was taking whatever was in low xmm1.

So using a movhlps xmm1,xmm0 as the last instruction, post whatever you would do if you coded to leave result in xmm0 then ensures the unix abi is conformed to. and we get the right values back ;) 

Phew.. long post, I hope this makes sense to you Jerome.

Peter
Title: Re: AVX and SSE support question
Post by: dicepd on December 06, 2017, 10:54:21 am
Quote
EDIT : I'm also tried to make compare with sse4 PTEST instruction but don't say how without a jump.

Jerome, I briefly looked at this when coding the AVX unit, ignoring all the finer points of that post,
It would seem you would have to load some sort of mask into a mmx reg do a suitable  binary comp on the result of ptest (similar to the xor eax, $f) and set result based on one of the flags.

Now for code as simple as we have at the moment I decided that as we were 'getting out' of the mmx pipline anyway the copy flags to eax and immediate xor where the mask is carried in the instruction and does not lead to a mem access would be cheaper than a potentially far access to 128 bits somewhere in mem.

On the other hand if we need to be as pedantic as that post, which I doubt as our numbers in the end should represent a point or vector in simple 3D space where NaNs are errors in logic and  0, -0 should never occur as if we were doing 'real' math on point in space we would always use some form of epsilon.

This may be a case for simple and quick routines v pedantic routines, Allow choice. TBH personally I would aways go for simple and quick and test for edge cases before main calcs where it is needed.
Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 06, 2017, 02:48:02 pm
...
Whether you need a jump or not depends on the code.  If you just need to set a result based on the zero flag, then you can use SETZ or SETNZ. There's no straight answer.

...
This may be a case for simple and quick routines v pedantic routines, Allow choice. TBH personally I would aways go for simple and quick and test for edge cases before main calcs where it is needed.

Thanks Curiosity it's little bit more clear in my mind. And i agree with you Peter


Ok this is all to do with return conventions in linux 64 ( SysV x86_64 to be exact), just as win64 has it's 4 registers rest on stack etc.
...
When we use nostackframe the above postamble does not occur. Therefore we got good values for x and y [low xxm0] but garbage for z and w, the calling convention was taking whatever was in low xmm1.

So using a movhlps xmm1,xmm0 as the last instruction, post whatever you would do if you coded to leave result in xmm0 then ensures the unix abi is conformed to. and we get the right values back ;) 

Phew.. long post, I hope this makes sense to you Jerome.


Thanks Peter i asked you this because under win32 the same behaviours appeared in MulAdd,MullDiv, Lerp function and also under Win64 with Combine2/3. it's seems depend on how args passed to the function and how manage the stack. It's like the compiler "push result over the stack" anyways all are tested and passed the tests. I've only 1h30 free this afternoon. I'll post the code of Unitest tonight

Thanks
Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 06, 2017, 04:03:13 pm
Ok i'm back and i've finish all testunits with Win32/64 SSE/SSE3/SSE4 and AVX. All with success on my pc. Now, we just miss tests for Linux32

Title: Re: AVX and SSE support question
Post by: dicepd on December 06, 2017, 09:16:41 pm
Given my comment on using epsilon for equality I though I would come up with this as a possibility.

Code: Pascal  [Select][+][-]
  1. function TGLZVector4f.IsEqual(constref Other: TGLZVector4f; const Epsilon: single); assembler; nostackframe; register;
  2. asm
  3.   movaps xmm0,[RDI]
  4.   movaps xmm1, [Other]
  5.   movss xmm2, [Epsilon]
  6.   shufps xmm2,xmm2, $0
  7.   subps xmm0,xmm1
  8.   andps xmm0, [RIP+cSSE_MASK_ABS]
  9.   cmpps  xmm0, xmm2, cSSE_OPERATOR_LESS_OR_EQUAL
  10.   movmskps eax, xmm0
  11.   xor eax, $f
  12.   setz al
  13. end;      
Title: Re: AVX and SSE support question
Post by: dicepd on December 07, 2017, 12:40:50 am
Jerome
Here is the inc file for unix32 SSE tested and 100% pass rate for SSe SSE3 SSE4.

One thing I noticed is that I can't use movaps reliably in 32 bit.

Will be out all tomorrow so won't get chance to finish the AVX till friday or saturday

Looks like the win32 AVX code works just fine for the unix32 as well , I just copied and renamed the file in preparation and gave it a blast through the test and 100% with no work to do.

Peter
Title: Re: AVX and SSE support question
Post by: SonnyBoyXXl on December 07, 2017, 03:40:34 pm
WOW, I'm impressed. I've not looked at this thread cause I was out for a business trip, but looks I got a ball running :)

But I found some time to work on the translation of the DirectX Math lib. I can use much of our inputs :)
THX.

So after I have finished it, I will put the files online on github to be available for everyone!

Best regards.
Title: Re: AVX and SSE support question
Post by: CuriousKit on December 07, 2017, 04:35:23 pm
Just one word of warning... OpenGL and DirectX handle their vectors and matrices differently.  Vectors are row-vectors in DirectX and column-vectors in OpenGL, matrices are row-major in DirectX and column-major in OpenGL, and transformations of vector arrays are performed by post-multiplying in DirectX, and pre-multiplying in OpenGL.

Ultimately, one is the complete transpose of the other.  If you're just passing the resultant vector array into a shader, you can get away with just using one set of functions - otherwise you have to be careful with the ordering in question and not blindly use the same set of functions for both APIs.
Title: Re: AVX and SSE support question
Post by: dicepd on December 08, 2017, 06:42:12 pm
Here is first shot at a timing test framework using FPCUnit again.

Not finished all tests yet but have a look and see if there are some other features wanted.

It outputs csv for the spreadsheet oriented people, github markdown, and Lazarus forum table (misnomered as html atm)
You need to drop these files in the test dir as it uses the TNativeGLZVector4f from unit tests.

Here is example output filtered down to 4 tests for forum table
Compiler Flags: -CfSSE3, -Sv, -dUSE_ASM, -dCONFIG_1
TestNativeAssembler
Vector Op Add Vector0.2390010.066999
Vector Op Add Single0.5530000.070000
Add Vector To Self0.1050000.101000
Add Single To Self0.1010000.099000

Peter


Edit added correct lpr redownload new zip
Title: Re: AVX and SSE support question
Post by: dicepd on December 09, 2017, 09:07:35 am
Ok getting on with the tests, should have the 'one test to rule them all' finished sometime this weekend.

But as a brain teaser it would seem the compiler is beating us hands down on certain functions. Esp Length.

A few results.

Compiler Flags: -CfSSE3, -Sv, -dUSE_ASM, -dSSE_CONFIG_1
TestNativeAssembler
Vector Length0.0860000.233000
Compiler Flags: -CfSSE42, -Sv, -dUSE_ASM_SSE_4, -dSSE4_CONFIG_1
TestNativeAssembler
Vector Length0.0860000.101000
Compiler Flags: -CfAVX, -Sv, -dUSE_ASM_AVX, -dAVX_CONFIG_1
TestNativeAssembler
Vector Length0.0810000.095000

It would seem it has a trick up it's sleeve where the code 'looks' worse but is more efficient.

Taking the nearest we got which was the AVX code

Ours
Code: Pascal  [Select][+][-]
  1.     vmovaps xmm0,[RDI]
  2.     vandps  xmm0, xmm0, [RIP+cSSE_MASK_NO_W]
  3.     vmulps  xmm0, xmm0, xmm0
  4.     vhaddps xmm0, xmm0, xmm0
  5.     vhaddps xmm0, xmm0, xmm0
  6.     vsqrtss xmm0, xmm0, xmm0        

The compilers
Code: Pascal  [Select][+][-]
  1.         # Register rsp allocated
  2. # Var $self located in register rax
  3. # Var $result located in register xmm0
  4.         # Register rdi,rax allocated
  5. # [136] begin
  6.         movq    %rdi,%rax
  7.         # Register rdi released
  8.         # Register xmm0 allocated
  9. # [137] Result := Sqrt((Self.X * Self.X) +(Self.Y * Self.Y) +(Self.Z * Self.Z));
  10.         vmovss  (%rax),%xmm0
  11.         vmulss  %xmm0,%xmm0,%xmm1
  12.         vmovss  4(%rax),%xmm0
  13.         vmulss  %xmm0,%xmm0,%xmm0
  14.         vaddss  %xmm1,%xmm0,%xmm1
  15.         vmovss  8(%rax),%xmm0
  16.         vmulss  %xmm0,%xmm0,%xmm0
  17.         vaddss  %xmm1,%xmm0,%xmm0
  18.         vsqrtss %xmm0,%xmm0,%xmm0
  19. # Var $result located in register xmm0
  20.         # Register rsp released
  21. # [141] end;
  22.         ret
  23.         # Register xmm0 released

It would seem that the 3 [v]movss (which clears all other bits in reg) is more efficient than two long fetches from mem.
In this function both native and asm leave the result on in xmm0 so I can see no way that the compiler is optimising the loop differently.

Edit:

Just modified the AVX inc to use the compilers code and now we get
Compiler Flags: -CfAVX, -Sv, -dUSE_ASM_AVX, -dAVX_CONFIG_1
TestNativeAssembler
Vector Length0.0830000.081000
which is probably the saving from removal of 20m movq   %rdi,%rax.


Title: Re: AVX and SSE support question
Post by: dicepd on December 09, 2017, 01:12:56 pm
So Looked at this a little more with Distance. The asm was just beating the native. So I tried this just for a test of my method of doing testcoding inside the unit testing framework, as a proof of concept.

Code: Pascal  [Select][+][-]
  1.   {$ifdef TEST}
  2.     vmovq xmm0, [rdi]
  3.     vmovq xmm1, [A]
  4.     vsubps xmm0, xmm0, xmm1
  5.     vmulps xmm0, xmm0, xmm0
  6.     vmovss xmm1, [rdi]8
  7.     vmovss xmm2, [A]8
  8.     vsubps xmm1, xmm1, xmm2
  9.     vmulps xmm1, xmm1, xmm1
  10.     vaddps xmm0, xmm0, xmm1
  11.     vhaddps xmm0, xmm0, xmm0
  12.     vsqrtss xmm0, xmm0, xmm0
  13.   {$else}
  14.     vmovaps xmm0,[RDI]
  15.     vmovaps xmm1, [A]
  16.     vsubps  xmm0, xmm0, xmm1
  17.     vandps  xmm0, xmm0, [RIP+cSSE_MASK_NO_W]
  18.     vmulps  xmm0, xmm0, xmm0
  19.     vhaddps xmm0, xmm0, xmm0
  20.     vhaddps xmm0, xmm0, xmm0
  21.     vsqrtss xmm0, xmm0, xmm0
  22.   {$endif}                      

Code: Pascal  [Select][+][-]
  1. Compiler Flags: -CfAVX, -Sv, -O3 ,-dUSE_ASM_AVX, -dAVX_CONFIG_1
  2. Test,            Native,   Assembler
  3. Vector Distance, 0.104000, 0.096000
  4. Vector Distance, 0.106000, 0.098000
  5. Vector Distance, 0.104000, 0.096000
  6. Vector Distance, 0.103001, 0.102000
  7.  
  8. Compiler Flags: -CfAVX, -Sv, -O3 ,-dUSE_ASM_AVX, -dTEST -dAVX_CONFIG_1_TEST
  9. Vector Distance, 0.099999, 0.088001
  10. Vector Distance, 0.104000, 0.090000
  11. Vector Distance, 0.102000, 0.088000
  12. Vector Distance, 0.104000, 0.088001
  13. Vector Distance, 0.101000, 0.087000
  14. Vector Distance, 0.102000, 0.088001
  15.  

And we see a speed up in code like this too.. did a few runs to verify that the new code was quicker.
Suprising results tbh.
Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 09, 2017, 02:38:20 pm
Hi Peter very cool timing test. I've made some test only with SSE at this time. One of the optimization possible with 64 bit is ou Vector are Aligned so for example
we can write

Code: Pascal  [Select][+][-]
  1. class operator TGLZVector4f.*(constref A, B: TGLZVector4f): TGLZVector4f; assembler; nostackframe; register;
  2. asm
  3.   movaps xmm0,[A]
  4.   //movaps xmm1,[B]
  5.   mulps  xmm0,[B] //xmm1
  6.   movaps [RESULT], xmm0
  7. end;
  8.  
  9. class operator TGLZVector4f.*(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler; nostackframe; register;
  10. asm
  11.   movaps xmm0,[A]
  12.   //movss  xmm1,[B]
  13.   shufps xmm1,[B] , 0 //xmm1, $00
  14.   mulps  xmm0,xmm1
  15.   movaps [RESULT], xmm0
  16. end;
  17.  
  18. function TGLZVector4f.Negate:TGLZVector4f; assembler; nostackframe; register;
  19. asm
  20.   movaps xmm0,[RCX]
  21.   //movaps xmm1,[RIP+cSSE_MASK_NEGATE]
  22.   xorps xmm0,[RIP+cSSE_MASK_NEGATE] //xmm1
  23.   movaps [RESULT],xmm0
  24. End;
  25.  

we can also optimize CrossProduct :

Code: Pascal  [Select][+][-]
  1. function TGLZVector4f.CrossProduct(constref A: TGLZVector4f): TGLZVector4f;assembler; nostackframe; register;
  2. asm
  3.   movaps xmm0,[RCX]
  4.  //  movaps xmm1, [A]                // xmm1 = v2
  5.   movaps xmm2, xmm0                // xmm2 = v1
  6.  // movaps xmm3, xmm1               // xmm3 = v2
  7.  
  8.   shufps xmm2, xmm0, $d2
  9.   shufps xmm3, [A], $c9  //xmm3, $c9
  10.  
  11.   shufps xmm0, xmm0, $c9  
  12.   shufps xmm1, xmm1, $d2  
  13.  
  14.   //shufps xmm2, xmm2, $d2 // Pass this 2 instructions up
  15.   //shufps xmm3, xmm3, $c9
  16.   mulps  xmm0, xmm1
  17.   mulps  xmm2, xmm3
  18.   subps  xmm0, xmm2
  19.   addps xmm0, [rip+cWOnevector4f] // it would better change by logical operator
  20.   //movhlps xmm1,xmm0
  21.   movaps [RESULT], xmm0      // return result
  22. end;  
  23.  

There code works win Win64bit but perhaps not with Linux64bit

I've also notice, with the functions that return Single, deleting the "movaps" can decrease performance (like min/max and with compares functions)

For AVX, this is the right functions for Distance and Length, your code is based on SS3 instructions, so it's less speed

Code: Pascal  [Select][+][-]
  1. function TGLZVector4f.Distance(constref A: TGLZVector4f):Single;assembler; nostackframe; register;
  2. // Result = xmm0
  3. Asm
  4.   vmovaps xmm0,[RCX]
  5.   //vmovaps xmm1, [A]
  6.   //vsubps  xmm0, xmm0, xmm1
  7.   vsubps  xmm0, xmm0, [A]
  8.   vandps  xmm0, xmm0, [RIP+cSSE_MASK_NO_W]
  9.   vdpps xmm0, xmm0, xmm0, $FF
  10.   vsqrtss xmm0, xmm0 , xmm0
  11.   //  movss [RESULT], {%H-}xmm0
  12. end;
  13.  
  14. function TGLZVector4f.Length:Single;assembler; nostackframe; register;
  15. Asm
  16.   vmovaps xmm0,[RCX]
  17.   vandps  xmm0, xmm0, [RIP+cSSE_MASK_NO_W]
  18.   vdpps xmm0, xmm0, xmm0, $FF
  19.   vsqrtss xmm0, xmm0, xmm0
  20. //  movss [RESULT], {%H-}xmm0
  21. end;

One thing i've noticed is with  procedure "operator" to self and the min/max procedures, sometime they are a little bit better sometime not (between -2%<>+3% gain of speed). except for Dot,Cross, DivideBy2, notmalize....

| Test                                | Native      | Assembler  | Gain in % 
| Vector Op Subtract Vector  | 0.114000  | 0.048000    | 57.895 % 
| Vector Op Add Vector          | 0.118000  | 0.050000    | 57.627 % 
| Vector Op Multiply Vector    | 0.116000  | 0.049000    | 57.758 % 
| Vector Op Divide Vector      | 0.136000  | 0.055000    | 59.559 % 
| Vector Op Add Single          | 0.118000  | 0.050000    | 57.627 % 
| Vector Op Subtract Single    | 0.114000  | 0.051000    | 55.263 % 
| Vector Op Multiply Single      | 0.118000  | 0.051000    | 56.780 % 
| Vector Op Divide Single        | 0.136000  | 0.055000    | 59.559 % 
| Vector Op Negative            | 0.119000  | 0.048000    | 59.664 % 
| Vector Op Equal                  | 0.047000  | 0.042000    | 10.637 % 
| Vector Op GT or Equal          | 0.049000  | 0.050000    | -2.042 % 
| Vector Op LT or Equal          | 0.047000  | 0.043000    | 8.511 % 
| Vector Op Greater              | 0.051000  | 0.050000    | 1.960 % 
| Vector Op Less                  | 0.048000  | 0.042000    | 12.501 % 
| Vector Op Not Equal            | 0.120000  | 0.050000    | 58.334 % 
| Add Vector To Self              | 0.090000  | 0.088000    | 2.222 % 
| Sub Vector from Self          | 0.088000  | 0.088000    | 0.000 % 
| Multiply Vector with Self      | 0.088000  | 0.090000    | -2.273 % 
| Divide Self by Vector          | 0.105000  | 0.107000    | -1.905 % 
| Add Single To Self              | 0.091000  | 0.090000    | 1.098 % 
| Sub Single from Self            | 0.088000  | 0.088000    | 0.000 % 
| Multiply Self with single        | 0.088000  | 0.089000    | -1.137 % 
| Divide Self by single            | 0.105000  | 0.105000    | 0.001 % 
| Invert Self                        | 0.068000  | 0.066999    | 1.472 % 
| Negate Self                        | 0.068000  | 0.066999    | 1.472 % 
| Self Abs                            | 0.067000  | 0.068000    | -1.493 % 
| Self Normalize                    | 0.410000  | 0.339000    | 17.317 % 
| Self Divideby2                    | 0.113000  | 0.093000    | 17.699 % 
| Self CrossProduct Vector      | 0.275000  | 0.188000    | 31.636 % 
| Self Min Vector                  | 0.078000  | 0.068000    | 12.821 % 
| Self Min Single                    | 0.069000  | 0.068000    | 1.450 % 
| Self Max Vector                  | 0.080000  | 0.069000    | 13.749 % 
| Self Max Single                  | 0.067000  | 0.069000    | -2.985 % 

Title: Re: AVX and SSE support question
Post by: dicepd on December 09, 2017, 02:47:34 pm
Quote
For AVX, this is the right functions for Distance and Length, your code is based on SS3 instructions, so it's less speed

Not according to the tests I have done, The distance is ~ 10% quicker using the code above in tests. If that holds for other platforms or not I will have to wait and see. I suspect the speedup is from the mem access. All mem access in linux64 is using aps variant already, I removed all movups variants. it crashes if you try to use non aligned mem.
Title: Re: AVX and SSE support question
Post by: dicepd on December 09, 2017, 03:00:46 pm
Anyway my priority is to finsh the test, I see you have added one of the features I have planned, ( Gain in %  ) the other I want to add is report the accuracy to how many dp. Probably more important with larger routines than we are doing now.
Title: Re: AVX and SSE support question
Post by: dicepd on December 09, 2017, 03:15:53 pm
So I cut down the AVX Distance to just this five instructions

Code: Pascal  [Select][+][-]
  1.     vmovaps xmm0,[RDI]
  2.     vsubps  xmm0, xmm0, [A]
  3.     vandps  xmm0, xmm0, [RIP+cSSE_MASK_NO_W]
  4.     vdpps xmm0, xmm0, xmm0, $FF
  5.     vsqrtss xmm0, xmm0, xmm0      

The code passes the functional test, but it is the slowest version so far??????

Vector Distance, 0.101000, 0.115000

Native now beats it.

Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 09, 2017, 03:17:04 pm
Anyway my priority is to finsh the test, I see you have added one of the features I have planned, ( Gain in %  ) the other I want to add is report the accuracy to how many dp. Probably more important with larger routines than we are doing now.

Yes it's clear. So i haven't make test with distance and lenght you have right

this

Code: Pascal  [Select][+][-]
  1. function TGLZVector4f.Distance(constref A: TGLZVector4f):Single;assembler; nostackframe; register;
  2. Asm
  3.   vmovaps xmm0,[RCX]
  4.   vsubps xmm0, xmm0, [A]   //xmm1
  5.   vmulps xmm0, xmm0, xmm0
  6.   vmovss xmm1, [RCX]8
  7.   vmovss xmm2, [A]8
  8.   vsubps xmm1, xmm1, xmm2
  9.   vmulps xmm1, xmm1, xmm1
  10.   vaddps xmm0, xmm0, xmm1
  11.   vhaddps xmm0, xmm0, xmm0
  12.   vsqrtss xmm0, xmm0, xmm0
  13. end;

Vector Op Distance, 0.201000, 0.073001, 63.681 %
Title: Re: AVX and SSE support question
Post by: dicepd on December 09, 2017, 04:38:39 pm
That might not pass the test for win64 I would try

Code: Pascal  [Select][+][-]
  1.   function TGLZVector4f.Distance(constref A: TGLZVector4f):Single;assembler; nostackframe; register;
  2.   Asm
  3.   {$ifdef TEST}
  4.     vmovq xmm0, [rcx]         // move 64 bits and clear top  x,y,0,0
  5.     vmovq xmm1, [A]           // move 64 bits and clear top  x1,y1,0,0
  6.     vsubps xmm0, xmm0, xmm1   // x-x1,y-y1,0,0
  7.     vmulps xmm0, xmm0, xmm0   // (x-x1)^2,(y-y1)^2,0,0
  8.     vmovss xmm1, [rcx]8      // z,0,0,0
  9.     vmovss xmm2, [A]8        //z1,0,0,0
  10.     vsubps xmm1, xmm1, xmm2   //z-z1,0,0,0
  11.     vmulps xmm1, xmm1, xmm1   //(z-z1)^2,0,0,0
  12.     vaddps xmm0, xmm0, xmm1   //(x-x1)^2+(z-z1)^2, (y-y1)^2, 0, 0
  13.     vhaddps xmm0, xmm0, xmm0  //(x-x1)^2+(z-z1)^2 + (y-y1)^2, 0, 0
  14.     vsqrtss xmm0, xmm0, xmm0            
  15.  

vmovq should be quicker as only moving 64 bits. It is one trick to not having to load a no w mask which is what I was trying to do in the first place. Does not matter so much for tthis routine but where we need to return a 0 in the W might be a quicker option than loading a mask. And maybe has to do with swapping pipeline between integer and float, I think I read there are some penalties there.
Title: Re: AVX and SSE support question
Post by: CuriousKit on December 09, 2017, 09:33:07 pm
I have a question about the comparison operators. Should they return true only if all of the elements are true? (e.g. Input1 = Input2 only if all the elements match, and Input1 < Input2 only if all the elements in Input1 are smaller than Input2) Or are they designed to return true if at least one of the elements are equal, for example.
Title: Re: AVX and SSE support question
Post by: dicepd on December 09, 2017, 10:13:20 pm
I have a question about the comparison operators. Should they return true only if all of the elements are true? (e.g. Input1 = Input2 only if all the elements match, and Input1 < Input2 only if all the elements in Input1 are smaller than Input2) Or are they designed to return true if at least one of the elements are equal, for example.

From what I can see in the Pascal code all must pass the test, so in linux64 and 32 they all do.
Title: Re: AVX and SSE support question
Post by: CuriousKit on December 09, 2017, 10:17:44 pm
I haven't finished my test kit yet and I still need to make the output a bit friendlier, but currently it fails for the = operator when compiled for SSE2 and SSE4.1, hence why I asked (haven't tested AVX yet).  Find attached said test kit.

(I just hope it compiles!)
Title: Re: AVX and SSE support question
Post by: dicepd on December 09, 2017, 10:29:12 pm
Ok here is the latest version. It combines functional and timing tests along with a framework for testing code which you are trying to improve.

Enclose your new test code with

Code: Pascal  [Select][+][-]
  1. {$ifdef TEST}
  2. newcode here
  3. {$else}
  4. leave old code here
  5. {$endif}

There are a set of build modes, each comes with a _TEST variant, which should have the same flags as the buildmode you are testing / developing for, along with a -dTEST flag to trigger the new code.

There are a set of string values in config.inc which will report the flags used if set properly. It should be fairly self explanatory if you take a look in config,inc. This may look like a hassle but remembering where those numbers you have came from is even worse. You should be able to get reproducable numbers and have some confidence in any improvements. If you are working in just one codebase (linux64) you could leave your test code in until it has been tested/transfered to other platforms.

What is there at the moment is just a base line. Create new build modes as required.

Full functional testing ( first three groups in the test harness) takes only a few tens of millisecs ~57 on my machine, Timing tests obviously a bit longer, but they are all selectable so timing the routine you are working on along with all functional tests will be half a second at the most.


Title: Re: AVX and SSE support question
Post by: dicepd on December 10, 2017, 02:37:43 pm
Ok this is now getting to where it should be usable for development.

I really like advanced records, they have allowed me to eat my own dog food and get to an environment where I can code my itches without polluting Beanz code base, and offer code back that may or may not be used but I can still use it alongside.

Using record helpers has meant I can do all coding for a function in one source file and test that functionality using one test file. This works in the unit test environment. Included in this release is a template.

Intended workflow is :
Copy and rename both the glz_template_code.pas and glz_template_test_cases.pas to whatever floats your boat.
Reflect the filename to the unitname (Ensure same case for us unix folks)
Decide what your function is going to be called and replace all the YOUR_FUNCTION_NAME_HERE placeholders along with the parameters needed.

Write your function in pascal for the TNativeGLZVector4f variant at the bottom of the file.
Write some functional tests if this  is new code and not just some of your old favorite working routine routines.
When happy with the pascal code copy to the TGLZVector4f variant just above.
Write a compare test in the test file. At this point you can run the comparison test using one of the native config build modes.
Write the timing test while in this build mode.
That's it for test coding everything is ready to start work on assembler.
Select another build mode you are wanting to code and the relevant function will become un-greyed.
Hit F9 and do a test build in this mode. there should now be a .s file in the output dir where you can work out what registers  your parameters are. Using this small file is much easier than looking through reams of output code from the main code base. And the sources are organised so the assembler call is right at the beginning of the .s file.

Code and test.

And hey if noone else is interested in you specialist function you still get all the advantages as if it were part of the core code base.

As I said, I have dogfooded this approach myself and included in this dist is the code I am working on which shows what a working env looks like. Toss it out when you have had a look.

Other changes I have moved the results files to a results sub dir.

Best to put this in a new clean dir alongside Beanz code.

Oh by the way did I say I like advanced records :D
Peter
 


Title: Re: AVX and SSE support question
Post by: dicepd on December 10, 2017, 07:36:47 pm
And the good news is when you convert something more compicated the speedups are dramatic.
Code: Pascal  [Select][+][-]
  1. Compiler Flags: -CfSSE3, -Sv, -O3, -dUSE_ASM, -dSSE_CONFIG_1
  2. Test,         Native,   Assembler,   Speedup
  3. AverageNormal4, 1.554000, 0.076000, 20.447260
  4.  

20x faster  8-)
Title: Re: AVX and SSE support question
Post by: dicepd on December 11, 2017, 12:57:25 pm
I am getting around to looking at pipeline  optimisations through reordering of naive working code.

It certainly makes a difference reordering and in going through this process I wanted a measure of worse or better that was a little more certain than a 20m x run time.

I read the intel doc here https://www.intel.co.uk/content/www/uk/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html. But a bit of overkill for comparitive timing.

I condensed this down the the following simple code, which is not 'perfect' but for the task of checking if changes make things better or worse I think it will do.
Code: Pascal  [Select][+][-]
  1. // hacky we just take the low 32 bits of the cpu counter
  2. // compiler protects regs no need to repilcate here.
  3. // we are only interested in the min value really
  4. // so a loop of 100 will always get us the min
  5. function ASMTick: int64; assembler;
  6. asm
  7. {$ifdef CPU64}
  8.   XOR rax, rax
  9.   CPUID
  10.   RDTSC  //Get the CPU's time stamp counter.
  11.   mov [Result], RAX
  12. {$else}
  13.   XOR eax, eax
  14.   CPUID
  15.   RDTSC  //Get the CPU's time stamp counter.
  16.   mov [Result], eax
  17. {$endif}
  18. end;  
  19.  

Comments please before I release something as quick and dirty as this.
It gives results such as

Proc Tick AverageNormal4:, 203, 316, 205.40

Which is Min, Max, Average. As we are only really interested in seeing code changes which affect the Min the odd bad number from a wrap wround is of no concern.


Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 11, 2017, 02:56:51 pm
Hi Peter, i'm working on the UnitTest i've made some little improvments. I'll post an update later.
By the way, in waiting for profiling/benchmarking you can take a look to the attached zip. It's a part of my project. It need some little updates, but normally, it's work well as is
Title: Re: AVX and SSE support question
Post by: dicepd on December 12, 2017, 09:38:14 am
Jerome,
Now I am well impressed with SSE et al.
While awaiting your changes, I got the routine I was talking about before, backported into my test code for real time engraving. It spends most of its time calculating normals so the screen representation looks good during this call. Calculating the mesh (250,000 vertices) takes a fraction of the time. So pure pascal does it in ~32sec, replace the normal calc with SSE and it does it in ~21 secs. For one call not a bad improvement in speed.

Time to call grind again and see where the next pinch point is, though from memory it's Point In Volume.
Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 17, 2017, 08:21:59 pm
Hi to all, i was very busy last week, so....
I took some time to code and post the new update of our VectorMath UnitTest.

What's new :
- I reorganized  and make some  little changes (split in two project  32bit/64bit), adding a BaseTimingTest Class, modified Config.inc...
- I added DistanceSquare, LengthSquare and Spacing SSE functions for TGLZVector4f  (native only have spacing)
- I added some {$ifdef TEST} in SSE functions
- I made some little update in my Profiler units (enough at this time, but not totally finished) and added for timing test
- I introduced Matrix and Quaternion with some SSE functions (sorry, only for win64 at this time)
- I introduced Vector Helper (including HmgPlane) and Matrix Helper
- I added Quaternion, Matrix, Test  and Timing  Case
- I added VectorAndHmgPlaneHelper Test Case
- I added a clean and full HTML output,. Just click on the HTML file to see result in your navigator

Now we have around 170 tests !

By reading the code you'll discovered some web links i found during my research. One of the most cool is
https://gcc.godbolt.org/ .This help me a lot for Matrix Invert function.

You'll also find  some little optimization of the SSE code (the most between ($ifdef TEST}) , but not everywhere yet

Bugs :
- Some VectorHelper TestCase are Wrong for SSE : CreatePlane , NormalizePlane and AverageNormal4 (wrong and not finished. I'm little tired. I'll restart later)

Note : Use Profiler in loop  is not advisable. Our tests are not enough complex. The call of RTDSC disturbing and decrease a lot
the performances,  and timing results are not really good. So i let profiler outside the loops

One thing i discovered it's impossible to create more than 1 helper per RECORD. The last declared, is the only  take in charge. sniff :(  :'(

Peter : Your ASMTick function not work on Win10 64bit. Mine in GLZCpuID (GetClockCycleTickCount) is ok

Cheers
Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 17, 2017, 11:41:30 pm
I come back i do some test and i've notice the sqrtss instruction is very slow it's better to use sqrtps in the function Quaternion.Magnitude my is less speed than native so
replace by this (it's for SSE3)

Code: Pascal  [Select][+][-]
  1. function TGLZQuaternion.Magnitude : Single; assembler; nostackframe; register;
  2. asm
  3.   movaps xmm0, [RCX]
  4.   mulps  xmm0, xmm0
  5.   movshdup    xmm1, xmm0
  6.   addps       xmm0, xmm1
  7.   movhlps     xmm1, xmm0
  8.   addss       xmm0, xmm1
  9.   sqrtps xmm0, xmm0
  10. end;
 

it's the best optimized code for Length/Magnitude


Title: Re: AVX and SSE support question
Post by: dicepd on December 18, 2017, 08:15:32 pm
Hi Jerome,

Just downloaded this, had a busy weekend with family pre christmas get together so not had much time this weekend.

Works fine in win64 for SSE (I like the html results) but as to getting started on a Linux version getting one of the Native_CONFIG_X working first would have to be a priority so I can then have something to work against.

It is now getting a bit complicated to use the forum as a source sharing device, do you have any sort of github or other source server where collaboration would be a little easier?
Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 19, 2017, 02:26:33 pm
It is now getting a bit complicated to use the forum as a source sharing device, do you have any sort of github or other source server where collaboration would be a little easier?

Hi Peter, yes it would be better. I have and account on Github. After Christmas i'm in hollidays. I'll config the git and send you the url  :)

Merry Christmas !  O:-)
Title: Re: AVX and SSE support question
Post by: dicepd on December 24, 2017, 09:22:37 am
Jerome,

Here are the helpers for *nix 64 and 32 bit, along with the main sources which include the three new methods in the base class. The avx helpers are just stubs at the moment.

AverageNorm4 is now completed, needs 1e-7 in the test as epsilon.

Also in a folder is the example of this single function to show a 'submission' request for a 'new' method using the template.

Bit painful this as you put your helpers in the main code file so I had to sort out quite a bit before I could even compile and start work. Lots of minor problems with case etc, but no point listing all these when I can sort them out later.

Ready for more functions to work on now :)

Merry Christmas  O:-)
Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 26, 2017, 08:00:39 pm
Hi Peter

Hi Peter

I just created a repository for our 'SIMD Vector Math Unit Tester' on Github https://github.com/jdelauney/SIMD-VectorMath-UnitTest

I also  added  type TGLZVector2f and implement some functions in SSE (Win64) and tests

See you soon
Title: Re: AVX and SSE support question
Post by: dicepd on December 26, 2017, 09:31:38 pm
Hi Jerome,

My github handle is the same as here, I wiil do a pull and get some minor fixes diffed up.

Peter
Title: Re: AVX and SSE support question
Post by: dicepd on December 27, 2017, 02:36:46 am
Ok I have the Unix 64 bit 2d vector working in 7 local commits along with the two native targets. Just awaiting the ability to push now.
Title: Re: AVX and SSE support question
Post by: dicepd on December 28, 2017, 09:34:18 am
My first checkin of getting unix working along with the start of vector2f and vector4f structural changes. Will continue with all other configs so they compile by setting up stubs for work that needs doing.

Added a .gitgnore in the project dir so git status does not fill the screen with crap.

Plane seems to be broken in win64 now, got normalize working by adding some var initialisation to overridden setup. (plane functions never worked in unix.)

Removed lps files.

I have a local mod here, not checked in, to change the utils xml handling from widestring to utf8 (more lazarus friendly in my opinion) will hold this till you agree. Otherwise I have other work (ifdefs) to make the xml work in unix. 
Title: Re: AVX and SSE support question
Post by: BeanzMaster on December 28, 2017, 09:57:45 am
No problem for me for handling xml from WideString to UTF8  :D
Title: Re: AVX and SSE support question
Post by: dicepd on December 28, 2017, 12:42:43 pm
32 bit  unix added for sse.

For single results in 32bit the ABI wants the result value in st0.

We already have the result in xmm0 (st0 by another name when in mmx mode)

So I cannot use nostackframe and have to copy the result to the stack, the compiler then copies this value on the stack back to st0.

Anyone any ideas on a method to not have to do this stack copy and just leave the value in xmm0?
Title: Re: AVX and SSE support question
Post by: dicepd on December 28, 2017, 05:39:14 pm
Jerome,

Everything that is not win64 is created, stubbed and runs, I am not saying it works, just you can run any test in unix64, 32 and win32 without the compiler complaining or runtime generating a seg fault.

I am sure this will not last as you get some more routines started, but I will try to keep it at least in this state, so you can concentrate on just win64 and one codebase.
Title: Re: AVX and SSE support question
Post by: SonnyBoyXXl on January 04, 2018, 10:56:33 am
Hi all,
I've finished the translation of the DirectX Math headers and test now the functions. I got a problem with this one:
Code: Pascal  [Select][+][-]
  1. function XMVectorSetBinaryConstant(constref C0: UINT32; constref C1: UINT32; constref C2: UINT32; constref C3: UINT32): TXMVECTOR;{ assembler;}
  2. const
  3.     g_vMask1: TXMVECTORU32 = (u: (1, 1, 1, 1));
  4. asm
  5.            // Move the parms to a vector
  6.            // __m128i vTemp = _mm_set_epi32(C3,C2,C1,C0);
  7.            MOVUPS        XMM0,TXMVECTOR([c3])
  8.            MOVUPS        XMM1,TXMVECTOR([c2])
  9.            MOVUPS        XMM2,TXMVECTOR([c1])
  10.            MOVUPS        XMM3,TXMVECTOR([c0])
  11.            PUNPCKLDQ   XMM3,XMM1
  12.            PUNPCKLDQ   XMM2,XMM0
  13.            PUNPCKLDQ   XMM3,XMM2  // XMM3 = vTemp
  14.            // Mask off the low bits
  15.            PAND    XMM3, [g_vMask1] // vTemp = _mm_and_si128(vTemp,g_vMask1);
  16.            // 0xFFFFFFFF on true bits
  17.            PCMPEQD XMM3, [g_vMask1] // vTemp = _mm_cmpeq_epi32(vTemp,g_vMask1);
  18.            // 0xFFFFFFFF -> 1.0f, 0x00000000 -> 0.0f
  19.            PAND    XMM3, [g_XMOne] // vTemp = _mm_and_si128(vTemp,g_XMOne);
  20.            MOVUPS  TXMVECTOR([result]), XMM3// return _mm_castsi128_ps(vTemp);
  21. end;  

The result in XMM3 is correct, as I see this in the debugger. But the function doesn't return the result.
Debuging the value of result gives a strange behavior. The result is available at a breakpoint at MOVUPS but not at end. See attachted pictures.
What is here wrong? I use
Code: Pascal  [Select][+][-]
  1. {$ASMMODE intel}
  2. {$Z4}
  3. {$CODEALIGN CONSTMIN=16}
  4. {$A4}  
and compiler flag -CfSSE.

The casting of constants C0, C1, C2, C3 as TXMVector is to avoid a compiler hint that MOVUPS needs a M128 adress. I no casting is done, the result is the same.

Title: Re: AVX and SSE support question
Post by: CuriousKit on January 04, 2018, 11:50:25 pm
Try looking at the disassembly of the program to see what it's doing in the function epilogue, and to also see what Result actually represents (likely a pre-reserved block of memory).
Title: Re: AVX and SSE support question
Post by: BeanzMaster on January 05, 2018, 12:33:49 am
Hi
1st instead of cast and movups use "movq"
2nd for const access in you are in a 64 bit system use the "RIP" mov xmm0, [RIP+MyConst]
3rd don't cast result Mov {REsult], xmm0 is enought

And like CuriousKit say, take a look in th .s file (see compiler -a options)
Title: Re: AVX and SSE support question
Post by: SonnyBoyXXl on January 05, 2018, 01:25:25 am
I'v found some time today to work on that problem.
First I changed the ASM code. I've checked how M$ VS 2017 handles the _mm_set_epi32 intrinsic. This is the
new routine:
Code: Pascal  [Select][+][-]
  1. function XMVectorSetBinaryConstant(constref C0: UINT32; constref C1: UINT32; constref C2: UINT32; constref C3: UINT32): TXMVECTOR;
  2.      assembler;
  3. const
  4.     g_vMask1: TXMVECTORU32 = (u: (1, 1, 1, 1));
  5. asm
  6.            // Move the parms to a vector
  7.            // __m128i vTemp = _mm_set_epi32(C3,C2,C1,C0);
  8.            movd        xmm0,dword ptr [C3]
  9.            movd        xmm1,dword ptr[C2]
  10.            movd        xmm2,dword ptr[C1]
  11.            movd        xmm3,dword ptr[C0]
  12.            punpckldq   xmm3,xmm1
  13.            punpckldq   xmm2,xmm0
  14.            punpckldq   xmm3,xmm2 // XMM3 = vTemp
  15.            // Mask off the low bits
  16.            PAND    XMM3, [g_vMask1] // vTemp = _mm_and_si128(vTemp,g_vMask1);
  17.            // 0xFFFFFFFF on true bits
  18.            PCMPEQD XMM3, [g_vMask1] // vTemp = _mm_cmpeq_epi32(vTemp,g_vMask1);
  19.            // 0xFFFFFFFF -> 1.0f, 0x00000000 -> 0.0f
  20.            PAND    XMM3, [g_XMOne] // vTemp = _mm_and_si128(vTemp,g_XMOne);
  21.            MOVUPS  TXMVECTOR([result]), XMM3// return _mm_castsi128_ps(vTemp);
  22. end;      

When I now make a breakpoint at the "movd        xmm1,dword ptr[C2]" line. I see in the debugger that the value of XMM0 is not what it should be.
Now I looked at the .s file.

Quote
DIRECTX.MATH_$$_XMVECTORSETBINARYCONSTANT$LONGWORD$LONGWORD$LONGWORD$LONGWORD$$TXMVECTOR:
.Lc128:
.Ll314:
# [2903] g_vMask1: TXMVECTORU32 = (u: (1, 1, 1, 1));
   pushl   %ebp
.Lc130:
.Lc131:
   movl   %esp,%ebp
.Lc132:
# Var C0 located in register eax
# Var C1 located in register edx
# Var C2 located in register ecx
# Var C3 located at ebp+12, size=OS_32
# Var $result located at ebp+8, size=OS_32
.Ll315:
# [2907] movd        xmm0,dword ptr [C3]
   movd   12(%ebp),%xmm0
.Ll316:
# [2908] movd        xmm1,dword ptr[C2]
   movd   (%ecx),%xmm1
.Ll317:
# [2909] movd        xmm2,dword ptr[C1]
   movd   (%edx),%xmm2
.Ll318:
# [2910] movd        xmm3,dword ptr[C0]
   movd   (%eax),%xmm3
.Ll319:
# [2911] punpckldq   xmm3,xmm1
   punpckldq   %xmm1,%xmm3
.Ll320:
# [2912] punpckldq   xmm2,xmm0
   punpckldq   %xmm0,%xmm2
.Ll321:
# [2913] punpckldq   xmm3,xmm2 // XMM3 = vTemp
   punpckldq   %xmm2,%xmm3
.Ll322:
# [2915] PAND    XMM3, [g_vMask1] // vTemp = _mm_and_si128(vTemp,g_vMask1);
   pand   TC_$DIRECTX.MATH$_$XMVECTORSETBINARYCONSTANT$crcD1D7FBA5_$$_G_VMASK1,%xmm3
.Ll323:
# [2917] PCMPEQD XMM3, [g_vMask1] // vTemp = _mm_cmpeq_epi32(vTemp,g_vMask1);
   pcmpeqd   TC_$DIRECTX.MATH$_$XMVECTORSETBINARYCONSTANT$crcD1D7FBA5_$$_G_VMASK1,%xmm3
.Ll324:
# [2919] PAND    XMM3, [g_XMOne] // vTemp = _mm_and_si128(vTemp,g_XMOne);
   pand   TC_$DIRECTX.MATH_$$_G_XMONE,%xmm3
.Ll325:
# [2920] MOVUPS  TXMVECTOR([result]), XMM3// return _mm_castsi128_ps(vTemp);
   movups   %xmm3,8(%ebp)
.Ll326:
# [2921] end;
   leave
   ret   $8
.Lc129:
.Lt14:
.Ll327:

The C3 ist located on the stack. So I change the function to
Code: Pascal  [Select][+][-]
  1. function XMVectorSetBinaryConstant(constref C0: UINT32; constref C1: UINT32; constref C2: UINT32; const C3: UINT32): TXMVECTOR;
  2.      assembler;  

The .s output is
Quote
DIRECTX.MATH_$$_XMVECTORSETBINARYCONSTANT$LONGWORD$LONGWORD$LONGWORD$LONGWORD$$TXMVECTOR:
.Lc128:
.Ll314:
# [2903] g_vMask1: TXMVECTORU32 = (u: (1, 1, 1, 1));
   pushl   %ebp
.Lc130:
.Lc131:
   movl   %esp,%ebp
.Lc132:
# Var C0 located in register eax
# Var C1 located in register edx
# Var C2 located in register ecx
# Var C3 located at ebp+12, size=OS_32
# Var $result located at ebp+8, size=OS_32
.Ll315:
# [2907] movd        xmm0,dword ptr [C3]
   movd   12(%ebp),%xmm0
.Ll316:
# [2908] movd        xmm1,dword ptr[C2]
   movd   (%ecx),%xmm1
.Ll317:
# [2909] movd        xmm2,dword ptr[C1]
   movd   (%edx),%xmm2
.Ll318:
# [2910] movd        xmm3,dword ptr[C0]
   movd   (%eax),%xmm3
.Ll319:
# [2911] punpckldq   xmm3,xmm1
   punpckldq   %xmm1,%xmm3
.Ll320:
# [2912] punpckldq   xmm2,xmm0
   punpckldq   %xmm0,%xmm2
.Ll321:
# [2913] punpckldq   xmm3,xmm2 // XMM3 = vTemp
   punpckldq   %xmm2,%xmm3
.Ll322:
# [2915] PAND    XMM3, [g_vMask1] // vTemp = _mm_and_si128(vTemp,g_vMask1);
   pand   TC_$DIRECTX.MATH$_$XMVECTORSETBINARYCONSTANT$crcD1D7FBA5_$$_G_VMASK1,%xmm3
.Ll323:
# [2917] PCMPEQD XMM3, [g_vMask1] // vTemp = _mm_cmpeq_epi32(vTemp,g_vMask1);
   pcmpeqd   TC_$DIRECTX.MATH$_$XMVECTORSETBINARYCONSTANT$crcD1D7FBA5_$$_G_VMASK1,%xmm3
.Ll324:
# [2919] PAND    XMM3, [g_XMOne] // vTemp = _mm_and_si128(vTemp,g_XMOne);
   pand   TC_$DIRECTX.MATH_$$_G_XMONE,%xmm3
.Ll325:
# [2920] MOVUPS  TXMVECTOR([result]), XMM3// return _mm_castsi128_ps(vTemp);
   movups   %xmm3,8(%ebp)
.Ll326:
# [2921] end;
   leave
   ret   $8
.Lc129:
.Lt14:
.Ll327:
As you see, the output is the same. But most of all, the value in XMM0 is now valid.

The only problem remain is that the result is still not valid.
If I change the routine that also the result is in a register and not on the stack everythink works perfekt (this means, I pass a TXMVector as input instead of the four UINT32. So I have the in-var in a register and also the out-var).
Seems this is a problem when result lays on the stack?
And I have found this post https://forum.lazarus.freepascal.org/index.php?topic=29097.0 (https://forum.lazarus.freepascal.org/index.php?topic=29097.0)
This is the bug tracker https://bugs.freepascal.org/view.php?id=32710#c104254 (https://bugs.freepascal.org/view.php?id=32710#c104254).

So I think the problem is the same on Windows?


Title: Re: AVX and SSE support question
Post by: SonnyBoyXXl on January 06, 2018, 05:54:01 pm
I got the function now running with this modifications:

Code: Pascal  [Select][+][-]
  1. function XMVectorSetBinaryConstant(const C0: UINT32; const C1: UINT32; const C2: UINT32; const C3: UINT32): PXMVECTOR;
  2. const
  3.     g_vMask1: TXMVECTORU32 = (u: (1, 1, 1, 1));
  4. var
  5.     x: TXMVECTOR;
  6. begin
  7.     asm
  8.                // Move the parms to a vector
  9.                // __m128i vTemp = _mm_set_epi32(C3,C2,C1,C0);
  10.                MOVD        XMM0, [C3]
  11.                MOVD        XMM1, [C2]
  12.                MOVD        XMM2, [C1]
  13.                MOVD        XMM3, [C0]
  14.                PUNPCKLDQ   XMM3,XMM1
  15.                PUNPCKLDQ   XMM2,XMM0
  16.                PUNPCKLDQ   XMM3,XMM2 // XMM3 = vTemp
  17.                // Mask off the low bits
  18.                PAND    XMM3, [g_vMask1] // vTemp = _mm_and_si128(vTemp,g_vMask1);
  19.                // 0xFFFFFFFF on true bits
  20.                PCMPEQD XMM3, [g_vMask1] // vTemp = _mm_cmpeq_epi32(vTemp,g_vMask1);
  21.                // 0xFFFFFFFF -> 1.0f, 0x00000000 -> 0.0f
  22.                PAND    XMM3, [g_XMOne] // vTemp = _mm_and_si128(vTemp,g_XMOne);
  23.                MOVUPS  [x], XMM3// return _mm_castsi128_ps(vTemp);
  24.     end;
  25.     Result := @x;
  26. end;

This is the .s output:

Quote
DIRECTX.MATH_$$_XMVECTORSETBINARYCONSTANT$LONGWORD$LONGWORD$LONGWORD$LONGWORD$$PXMVECTOR:
.Lc128:
.Ll314:
# [2925] begin
   pushl   %ebp
.Lc130:
.Lc131:
   movl   %esp,%ebp
.Lc132:
   leal   -80(%esp),%esp
# Var C0 located at ebp-16, size=OS_32
# Var C1 located at ebp-32, size=OS_32
# Var C2 located at ebp-48, size=OS_32
# Var C3 located at ebp+8, size=OS_32
# Var $result located at ebp-64, size=OS_32
# Var x located at ebp-80, size=OS_NO
   movl   %eax,-16(%ebp)
   movl   %edx,-32(%ebp)
   movl   %ecx,-48(%ebp)
#  CPU PENTIUM
.Ll315:
# [2929] movd        xmm0, [C3]
   movd   8(%ebp),%xmm0
.Ll316:
# [2930] movd        xmm1, [C2]
   movd   -48(%ebp),%xmm1
.Ll317:
# [2931] movd        xmm2, [C1]
   movd   -32(%ebp),%xmm2
.Ll318:
# [2932] movd        xmm3, [C0]
   movd   -16(%ebp),%xmm3
.Ll319:
# [2933] punpckldq   xmm3,xmm1
   punpckldq   %xmm1,%xmm3
.Ll320:
# [2934] punpckldq   xmm2,xmm0
   punpckldq   %xmm0,%xmm2
.Ll321:
# [2935] punpckldq   xmm3,xmm2 // XMM3 = vTemp
   punpckldq   %xmm2,%xmm3
.Ll322:
# [2937] PAND    XMM3, [g_vMask1] // vTemp = _mm_and_si128(vTemp,g_vMask1);
   pand   TC_$DIRECTX.MATH$_$XMVECTORSETBINARYCONSTANT$LONGWORD$LONGWORD$LONGWORD$LONGWORD$$PXMVECTOR_$$_G_VMASK1,%xmm3
.Ll323:
# [2939] PCMPEQD XMM3, [g_vMask1] // vTemp = _mm_cmpeq_epi32(vTemp,g_vMask1);
   pcmpeqd   TC_$DIRECTX.MATH$_$XMVECTORSETBINARYCONSTANT$LONGWORD$LONGWORD$LONGWORD$LONGWORD$$PXMVECTOR_$$_G_VMASK1,%xmm3
.Ll324:
# [2941] PAND    XMM3, [g_XMOne] // vTemp = _mm_and_si128(vTemp,g_XMOne);
   pand   TC_$DIRECTX.MATH_$$_G_XMONE,%xmm3
.Ll325:
# [2942] MOVUPS 
  • , XMM3// return _mm_castsi128_ps(vTemp);

   movups   %xmm3,-80(%ebp)
#  CPU PENTIUM
.Ll326:
# [2944] result:=@x;
   leal   -80(%ebp),%eax
   movl   %eax,-64(%ebp)
.Ll327:
# [2945] end;
   movl   %ebp,%esp
   popl   %ebp
   ret   $4
.Lc129:
.Lt14:
.Ll328:

Why is this working?
Title: Re: AVX and SSE support question
Post by: dicepd on January 06, 2018, 09:55:05 pm
When you have parameters or returns on the stack you have to look at the size to try to work out if it is a value or a pointer.

if the return is a pointer then you can use something like this.
Code: Pascal  [Select][+][-]
  1.   mov    ebx,  [Result]
  2.   vmovups [ebx], xmm0                
  3.  

for parameter pointers which are one the stack you will require something like this

Code: Pascal  [Select][+][-]
  1.   mov    ebx,  [right]
  2.   movups xmm5, [ebx]      

32 bit usually puts pointer for most things on the stack.

Looking at your case you declared a local variable which was allocated space on the stack which is why you have the following
Code: Pascal  [Select][+][-]
  1. movups   %xmm3,-80(%ebp)
this is a value on the stack not a pointer.
Title: Re: AVX and SSE support question
Post by: SonnyBoyXXl on January 07, 2018, 12:36:29 am
Yes, it is really confusing.
 :(

I've now continue testing and have now another function:
Code: Pascal  [Select][+][-]
  1. function XMVectorSet(const x, y, z, w: single): TXMVECTOR; assembler;
  2. asm
  3.                MOVD        XMM0, [w]
  4.                MOVD        XMM1, [z]
  5.                MOVD        XMM2, [y]
  6.                MOVD        XMM3, [x]
  7.                PUNPCKLDQ   XMM3,XMM1
  8.                PUNPCKLDQ   XMM2,XMM0
  9.                PUNPCKLDQ   XMM3,XMM2
  10.                MOVUPS  [result], XMM3 // _mm_set_ps( w, z, y, x );
  11. end;  

As you see, this is the same assembler code as the first part of XMVectorSetBinaryConstant. The difference is that the input parameters are of type single.
Therefore the .s output is

Quote
DIRECTX.MATH_$$_XMVECTORSET$SINGLE$SINGLE$SINGLE$SINGLE$$TXMVECTOR:
.Lc261:
.Ll822:
# [5426] asm
   pushl   %ebp
.Lc263:
.Lc264:
   movl   %esp,%ebp
.Lc265:
# Var $result located in register eax
# Var x located at ebp+20, size=OS_F32
# Var y located at ebp+16, size=OS_F32
# Var z located at ebp+12, size=OS_F32
# Var w located at ebp+8, size=OS_F32
.Ll823:
# [5427] MOVD        XMM0, [w]
   movd   8(%ebp),%xmm0
.Ll824:
# [5428] MOVD        XMM1, [z]
   movd   12(%ebp),%xmm1
.Ll825:
# [5429] MOVD        XMM2, [y]
   movd   16(%ebp),%xmm2
.Ll826:
# [5430] MOVD        XMM3,

   movd   20(%ebp),%xmm3
.Ll827:
# [5431] PUNPCKLDQ   XMM3,XMM1
   punpckldq   %xmm1,%xmm3
.Ll828:
# [5432] PUNPCKLDQ   XMM2,XMM0
   punpckldq   %xmm0,%xmm2
.Ll829:
# [5433] PUNPCKLDQ   XMM3,XMM2
   punpckldq   %xmm2,%xmm3
.Ll830:
# [5434] MOVUPS  [result], XMM3 // _mm_set_ps( w, z, y, x );
   movups   %xmm3,(%eax)
.Ll831:
# [5435] end;
   leave
   ret   $16

the difference is that here the result is in an register.

So what comes out is:

Same routine, input params as UINT32:
Quote
# Var C0 located in register eax
# Var C1 located in register edx
# Var C2 located in register ecx
# Var C3 located at ebp+12, size=OS_32
# Var $result located at ebp+8, size=OS_32
--> not working directly, address of result is on the stack, must be loaded into register


input params as SINGLE:
Quote
# Var $result located in register eax
# Var C0 located at ebp+20, size=OS_F32
# Var C1 located at ebp+16, size=OS_F32
# Var C2 located at ebp+12, size=OS_F32
# Var C3 located at ebp+8, size=OS_F32
--> working, cause address of result is located in register

I've added your comment about the stack parameter in the routine, and is working now.

Code: Pascal  [Select][+][-]
  1. function XMVectorSetBinaryConstant(constref C0, C1, C2: UINT32; const c3: UINT32): TXMVECTOR; assembler;
  2. const
  3.     g_vMask1: TXMVECTOR = (u32: (1, 1, 1, 1));
  4. asm
  5.            // Move the parms to a vector
  6.            // __m128i vTemp = _mm_set_epi32(C3,C2,C1,C0);
  7.            MOVD        XMM0, [C3]
  8.            MOVD        XMM1, [C2]
  9.            MOVD        XMM2, [C1]
  10.            MOVD        XMM3, [C0]
  11.            PUNPCKLDQ   XMM3,XMM1
  12.            PUNPCKLDQ   XMM2,XMM0
  13.            PUNPCKLDQ   XMM3,XMM2 // XMM3 = vTemp
  14.            // Mask off the low bits
  15.            PAND    XMM3, [g_vMask1] // vTemp = _mm_and_si128(vTemp,g_vMask1);
  16.            // 0xFFFFFFFF on true bits
  17.            PCMPEQD XMM3, [g_vMask1] // vTemp = _mm_cmpeq_epi32(vTemp,g_vMask1);
  18.            // 0xFFFFFFFF -> 1.0f, 0x00000000 -> 0.0f
  19.            PAND    XMM3, [g_XMOne] // vTemp = _mm_and_si128(vTemp,g_XMOne);
  20.            PUSH    EBX
  21.            MOV     EBX, [result]
  22.            MOVUPS  [EBX], XMM3 // return _mm_castsi128_ps(vTemp);
  23.            POP     EBX
  24. end;    

Thanks!
Title: Re: AVX and SSE support question
Post by: CuriousKit on February 09, 2018, 06:53:43 pm
I developed a new feature for FPC that might help you with this endeavour. Still undergoing testing though before it makes it into the 3.1.1 build... "vectorcall".

https://bugs.freepascal.org/view.php?id=32781

Note that I also fixed the System V ABI to use the SSE registers properly, so the code that passes the result into the low half of XMM0 and XMM1 will have to be reworked a bit.
Title: Re: AVX and SSE support question
Post by: dicepd on February 14, 2018, 10:30:04 pm
Ok, CuriousKit  I am impressed  :)

Will update my trunk and see what happens if I ifdef a few VectorCalls into the code in Linux64 and also if doing nothing at all breaks the existing code.

Removing the movhlps xmm0,  xmm1 can only be a good thing if the result is passed back in xmm0
Title: Re: AVX and SSE support question
Post by: dicepd on February 15, 2018, 07:17:25 am
Good news and bad news I'm afraid CuriousKit

Good news is, that our code base still works fine as is with trunk which has your patches applied.

Bad news is that at least in *nix64 vectorcall causes an internal error on the following bit of test code I was trying when evaluating the patch.

Code: Pascal  [Select][+][-]
  1. class operator TGLZVector4f.+(constref A, B: TGLZVector4f): TGLZVector4f; {$ifdef USE_VECTORCALL} vectorcall;{$else}register;{$endif} assembler; nostackframe;
  2. asm
  3.   {$ifndef USE_VECTORCALL}
  4.   movaps  xmm0, [A]
  5.   movaps  xmm1, [B]
  6.   {$endif}
  7.   addps   xmm0, xmm1
  8.   {$ifndef USE_VECTORCALL}
  9.   movhlps xmm1, xmm0
  10.   {$endif}
  11. end;        
  12.  

I did notice that none of the tests in your patch test for methods of the record, but are just using the record as a parameter.

Note as I could not get it to compile the code above is probably not going to work as I was just trying to find out where the new calling convention was placing the parameters in the registers.
Title: Re: AVX and SSE support question
Post by: dicepd on February 15, 2018, 09:30:17 am
More issues with this patch.

It seems that it is not passing Self in RDI in *nix64 anymore.

I can no longer see register allocation or parameter allocation  in the .s assembler file which is hampering trying to work out what is going on.

It seems to be a little more broken than first thought :-\

ok it would seem that the RDI bug which did not show itself first time is down to Self getting out of alignment. 

I changed  movaps [RDI], xmm0  to movups [RDI], xmm0  and tests all worked again. I will dig some more as to why this might be happening though still hampered by lack of info in .s file

A suggested test for the above

Code: Pascal  [Select][+][-]
  1. MyXMM.Create(V1,V2)
  2. begin
  3.  Self := V1 + V2;
  4. end;
  5.  

Or something along those lines depending how you want to declare things.
Title: Re: AVX and SSE support question
Post by: Thaddy on February 15, 2018, 09:42:32 am
I can reproduce that. It seems you are not conservative enough with registers? But it is a great effort.
Note that as it is now it is also hard to port.
Title: Re: AVX and SSE support question
Post by: dicepd on February 15, 2018, 10:35:32 am
@Thaddy

I presume when you mean hard to port you are thinking of ARM and neon?
If I could get trunk to compile aarch64 on my raspi3 directly (not Xplatform) I would be looking at neon versions of what we are doing,
Title: Re: AVX and SSE support question
Post by: dicepd on February 15, 2018, 11:41:30 am
Quote
Note that I also fixed the System V ABI to use the SSE registers properly, so the code that passes the result into the low half of XMM0 and XMM1 will have to be reworked a bit.

Ok I have tried to force this behaviour without using vectorcall with no luck. It would seem that the original calling convention still applies even though you have made changes to the unix ABI.
Title: Re: AVX and SSE support question
Post by: CuriousKit on February 16, 2018, 11:55:50 am
It will only pass parameters into the full XMM0 register etc if they are aligned to 16-byte boundaries.

What internal error are you getting? I'll see if I can track it down.  "vectorcall" should be ignored on *nix64.
Title: Re: AVX and SSE support question
Post by: dicepd on February 16, 2018, 12:42:08 pm
Quote
What internal error are you getting? I'll see if I can track it down.  "vectorcall" should be ignored on *nix64.

Not a lot of help from the message itself I am afraid. The popup says it is a scanner message but I do not know how it determined that.

Code: Pascal  [Select][+][-]
  1. vectormath_vector4f_unix64_sse_imp.inc(4,1) Error: Compilation raised exception internally
  2.  

Quote
It will only pass parameters into the full XMM0 register etc if they are aligned to 16-byte boundaries.

Alignment seems to be an issue see the self bug above. Do we have to use the align 16 after the record as in the example now?

Title: Re: AVX and SSE support question
Post by: dicepd on February 16, 2018, 01:38:47 pm
OK CuriousKit,

I just ran all the unit test under Linux (even commenting out the $ifdef win64 so it ran the whole tests) and everything passes your unit tests.
Title: Re: AVX and SSE support question
Post by: CuriousKit on February 16, 2018, 01:43:45 pm
Interesting.  And yes, the records have to be aligned to 16-byte boundaries, either with "align 16" if that's been implemented, or using {CODEALIGN RECORDMIN=16}{PACKRECORDS C} (see the code examples).  This is a design choice that is also used in C++, because MOVAPS etc. is several cycles faster than MOVUPS.

An internal error or exception is automatically a bug even if you have the most garbled code in the universe.  I don't have a Linux 64 system to test the compiler unfortunately, although I wonder what the exception is (I also wonder if there's a way to update the error messages to actually say what the exception is).
Title: Re: AVX and SSE support question
Post by: dicepd on February 16, 2018, 02:27:27 pm
All the code was previously aligned as we use aps variant throughout. Somehow with this patch it is ignoring the alignment in unix.

Also most of our calling conventions are using constref so there is no copy to the stack, which seems to be happening in the assembler generated for your unit tests.

Code: Pascal  [Select][+][-]
  1. .section .text.n_p$vectorcall_hva_test2_$$_plus$tm128$tm128$$tm128
  2.         .balign 16,0x90
  3. .globl  P$VECTORCALL_HVA_TEST2_$$_plus$TM128$TM128$$TM128
  4.         .type   P$VECTORCALL_HVA_TEST2_$$_plus$TM128$TM128$$TM128,@function
  5. P$VECTORCALL_HVA_TEST2_$$_plus$TM128$TM128$$TM128:
  6. .Lc1:
  7. # Temps allocated between rbp-72 and rbp-52
  8.         # Register rbp allocated
  9. .Ll1:
  10. # [vectorcall_hva_test1.pas]
  11. # [26] begin
  12.         pushq   %rbp
  13. .Lc3:
  14. .Lc4:
  15.         movq    %rsp,%rbp
  16. .Lc5:
  17.         leaq    -80(%rsp),%rsp
  18. # Temp -72,16 allocated
  19. # Temp -16,16 allocated
  20. # Var X located at rbp-16, size=OS_128
  21. # Temp -32,16 allocated
  22. # Var Y located at rbp-32, size=OS_128
  23. # Temp -48,16 allocated
  24. # Var $result located at rbp-48, size=OS_128
  25. # Temp -52,4 allocated
  26. # Var I located at rbp-52, size=OS_S32
  27.         # Register xmm0,xmm1 allocated
  28.         movdqa  %xmm0,-16(%rbp)
  29.         # Register xmm0 released
  30.         movdqa  %xmm1,-32(%rbp)
  31.         # Register xmm1 released
  32. .Ll2:
  33.  
  34.  
Title: Re: AVX and SSE support question
Post by: CuriousKit on February 16, 2018, 03:32:21 pm
Hmmm, looks like I have a way to go before this addition is correct.  I noticed a stack realignment in one of my tests (and made a comment about it), but that happened under Windows, not *nix.  Sorry that it's not quite going to plan.  I'll try to get a computer rigged up to use Linux in the future so I can test these issues more thoroughly.

Can you submit a bug report with a reproducible example of incorrect functionality with alignment and the internal exception?
Title: Re: AVX and SSE support question
Post by: CuriousKit on February 16, 2018, 04:22:34 pm
That disassembly you posted... is that on Linux?
Title: Re: AVX and SSE support question
Post by: dicepd on February 16, 2018, 04:26:47 pm
Yes that is a Linux
Title: Re: AVX and SSE support question
Post by: dicepd on February 16, 2018, 06:26:23 pm
Quote
Sorry that it's not quite going to plan.
No problems there very willing to help any way I can.

Trying to get some test for you I started realy simple

Code: Pascal  [Select][+][-]
  1. program vectorcall_pd_test1;
  2.  
  3. {$IFNDEF CPUX86_64}
  4.   {$FATAL This test program can only be compiled on Windows or Linux 64-bit with an Intel processor }
  5. {$ENDIF}
  6. {$MODESWITCH ADVANCEDRECORDS}
  7. {$ASMMODE Intel}
  8. type
  9.   { TM128 }
  10.   {$push}
  11.   {$CODEALIGN RECORDMIN=16}
  12.   {$PACKRECORDS C}
  13.   TM128 = record
  14.     public
  15.     class operator +(A, B: TM128): TM128; vectorcall;
  16.     case Byte of
  17.       0: (M128_F32: array[0..3] of Single);
  18.       1: (M128_F64: array[0..1] of Double);
  19.   end;
  20.   {$pop}
  21.  
  22. { TM128 }
  23.  
  24. class operator TM128.+(A, B: TM128): TM128; vectorcall; assembler; nostackframe;
  25. asm
  26.   addps xmm0, xmm1
  27. end;
  28.  
  29. var
  30.   xm1, xm2, xm3: TM128;
  31.  
  32. begin
  33.   xm3 := xm1 + xm2;
  34.  
  35. end.                              
  36.  

And the assembler produced was as good as it could get, with the exception of movdqa  %xmm0,%xmm0

Code: Pascal  [Select][+][-]
  1. section .text.n_p$vectorcall_pd_test1$_$tm128_$__$$_plus$tm128$tm128$$tm128
  2.         .balign 16,0x90
  3. .globl  P$VECTORCALL_PD_TEST1$_$TM128_$__$$_plus$TM128$TM128$$TM128
  4.         .type   P$VECTORCALL_PD_TEST1$_$TM128_$__$$_plus$TM128$TM128$$TM128,@function
  5. P$VECTORCALL_PD_TEST1$_$TM128_$__$$_plus$TM128$TM128$$TM128:
  6. .Lc1:
  7. # [vectorcall_hva_test2.pas]
  8. # [29] asm
  9. #  CPU ATHLON64
  10. .Ll1:
  11. # [30] addps xmm0, xmm1
  12.         addps   %xmm1,%xmm0
  13. #  CPU ATHLON64
  14. .Ll2:
  15. # [31] end;
  16.         ret
  17.         # Register xmm0 released
  18. .Lc2:
  19. .Lt2:
  20. .Le0:
  21.  
  22.  [37] xm3 := xm1 + xm2;
  23.         movdqa  U_$P$VECTORCALL_PD_TEST1_$$_XM2(%rip),%xmm1
  24.         # Register xmm0 allocated
  25.         movdqa  U_$P$VECTORCALL_PD_TEST1_$$_XM1(%rip),%xmm0
  26.         call    P$VECTORCALL_PD_TEST1$_$TM128_$__$$_plus$TM128$TM128$$TM128@PLT
  27.         movdqa  %xmm0,%xmm0
  28.         movaps  %xmm0,U_$P$VECTORCALL_PD_TEST1_$$_XM3(%rip)
  29.         # Register xmm0 released
  30.  
  31.  

So it looks like I will have my work cut out to try to get a simple test with the errors. Will let you know when I get something simple enough to submit as a bug.
 Though there is no reference to self here.
Title: Re: AVX and SSE support question
Post by: dicepd on February 16, 2018, 07:05:01 pm
Next simple test to what happens to self.
Code: Pascal  [Select][+][-]
  1.     function Add(A: TM128): TM128; vectorcall;
  2.  

Code: Pascal  [Select][+][-]
  1. # [33] xm3 := xm1.Add(xm2);
  2.         movdqa  U_$P$VECTORCALL_PD_TEST1_$$_XM2(%rip),%xmm0
  3.         leaq    U_$P$VECTORCALL_PD_TEST1_$$_XM1(%rip),%rcx
  4.         movaps  %xmm0,%xmm1
  5.         call    P$VECTORCALL_PD_TEST1$_$TM128_$__$$_ADD$TM128$$TM128@PLT
  6.         movdqa  %xmm0,%xmm0
  7.         movaps  %xmm0,U_$P$VECTORCALL_PD_TEST1_$$_XM3(%rip)

So it would appear that Self is passed as a pointer in RCX, However unix convention is Self is passed as pointer in RDI.
First parameter is passed in xmm1, although having gone via xmm0 for some reason.
There appears to be a redundant movdqa after the call.
There is still no info in the body of the function to indicate what parameters are passed in what registers, this has been ascertained only by looking at the usage call not from the function definition where we usually find this information.
Return value is good as there is no split of the 128 into two 64s as in 3.0.4 fpc.

I hope this helps a bit.
Title: Re: AVX and SSE support question
Post by: dicepd on February 16, 2018, 08:57:01 pm
Small progress on working through possible issues with our code base.

If I declare like this

Code: Pascal  [Select][+][-]
  1.   TVector4fType = packed array[0..3] of Single;
  2.   TM128 = record
  3.     public
  4.     class operator +(A, B: TM128): TM128; vectorcall;
  5.     case Byte of
  6.       0: (M128_F32: TVector4fType);
  7.   end;                                                                                            
  8.  

It is not recognised as a vector and parameters are passed as pointers in standard registers. The packed keyword makes no difference if it is present or not. The generated  code then expects 4 singles in xmm0-3 which it uses to populate the result via a series of 4 movss.

This is possibly why it is getting confused by our codebase.
Title: Re: AVX and SSE support question
Post by: CuriousKit on February 17, 2018, 01:40:50 am
I can explain why the last example is getting passed as a pointer... there's nothing to restrict it to a 16-byte boundary, so the compiler has to assume that every variable of that type is unaligned even if it does happen to fall on a 16-byte boundary, hence it's treated as a complex record type.  That's intended behaviour.

I noticed the "movdqa %xmm0,%xmm0" myself during development and wasn't sure what was causing it, but when I submit my next batch of improvements to the peephole optimizer, I'll look out for that one (it already removes references to "mov %eax,%eax", for example).  I'll have to double-check though that the matching ymm or zmm isn't being used, because "movdqa %xmm0,%xmm0" has the effect of zeroing the upper 128 bits of %ymm0 and the upper 384 bits of %zmm0, and hence isn't a null operation.

Passing Self into RCX when it should be RDI is indeed a bug, and I would recommend posting this as an actual bug report.  I'm not sure if I can do anything about Self always being passed by reference though - I think the compiler treats it like an object - what's the generated assembly for a record containing a single integer field?  Moving the 2nd parameter into XMM0 and then into XMM1 looks like a compiler inefficiency in regards to how it allocates temporary registers (do the debug messages say anything about the registers being allocated and released?), and can either be corrected in the peephole optimizer (which does similar things already) or with more advanced Data Flow Analysis (something I'm working on which I named the "Deep Optimizer" before I discovered the official term) -  such a feature will also help to correct the mixing of 'movdqa' and 'movaps', since using the wrong one will incur a performance penalty (you should only use 'movdqa' if you're using the relevant registers for integer operations).

To note what parameters are passed into what registers, you'll have to compile a Pascal function that uses vectorcall with a number of vector-like parameters and see how they interact.  Vectorcall dictates that XMM0 to XMM5 are used for vector/float inputs, HFAs and HVAs, although if there aren't enough free registers to fully contain a homogeneous aggregate (basically, an array of 1 to 4 aligned vectors or floats of the same type), it is wholly passed on the stack, but any vector/float parameters that follow will go into the registers that are left.  Return values are passed through XMM0 to XMM3, although XMM1 to XMM3 are only used if the return type is a homogeneous aggregate.  Integer parameters are passed in the same way as the regular Win64 calling convention dictates (or on Win32, following the rules of 'fastcall').
Title: Re: AVX and SSE support question
Post by: dicepd on February 17, 2018, 06:57:09 am
Calling convention bug reported.
https://bugs.freepascal.org/view.php?id=33184

Quote
I can explain why the last example is getting passed as a pointer...
I only posted simple code, this was surrounded by the usual codealign etc.
I was assuming that the bug lay in the possibility that the type declared was not of array of 4 and the compiler did not look at the underlying type to see if it was typecast compatible with array of 4 singles.

One thing I noted when I  roughed what I had learnt from these  tests into main code base was that the usage of the {$PACKRECORDS C} in your tests breaks alignment of any consts declared using this type. Removal of the {$PACKRECORDS C} fixed a whole slew of problems in code such as
Code: Pascal  [Select][+][-]
  1. movaps    xmm1, XMMWORD PTR [RIP+cOneVector4f]
which would segfault with it in (the usual indication that a non aligned memory access had occured)

Quote
To note what parameters are passed into what registers, you'll have to compile a Pascal function that uses vectorcall with a number of vector-like parameters and see how they interact.

hmm.... this just makes life more difficult than it was before, doable but not ideal by any means. As shown by the various questions on calling parameter usage already within this thread it is difficult enough already working this stuff out, without having to trawl through trying to find usage.

Not such a real problem for us as we have a well structured test harness where we can work this out but for other users maybe not such a good plan.

Quote
I'm not sure if I can do anything about Self always being passed by reference though

I doubt you could do anything about that without the fpc devs accepting a pure fpc calling convention for this, which I seriously doubt will ever happen. Too many problems in shared libs etc.

So I will park any further testing until we get the RCX/RDI issue resolved and get back to testing our stuff. But in general this is looking promising.

Title: Re: AVX and SSE support question
Post by: CuriousKit on February 18, 2018, 12:40:23 am
I stand corrected on one thing... the System V ABI does support unaligned vectors, unlike vectorcall.  I'll see if I can correct that and hence fix your library!
Title: Re: AVX and SSE support question
Post by: dicepd on February 18, 2018, 06:32:19 am
I stand corrected on one thing... the System V ABI does support unaligned vectors, unlike vectorcall.  I'll see if I can correct that and hence fix your library!

??? nothing in our library uses unaligned vectors. At least in 64bit it does not, we have been quite strict in making sure that all accesses are aligned for performance reasons. Only 32bit uses unaligned assembler variants, and to be honest 32bit is not the  priority as it is much slower because of the fewer registers available and the fact we are forced to use unaligned assembler calls.

32 bit is a limitation of the fpc pascal calling convention, it should be possible to make vectorcall work for all 32bit intel platforms as the calling convention in 32 bit is a pascal defined calling convention if I remember correctly.
Title: Re: AVX and SSE support question
Post by: BeanzMaster on March 31, 2018, 08:20:23 pm
Hi to all i'm currently made some update of vectormath lib (https://github.com/jdelauney/SIMD-VectorMath-UnitTest)

I'm working with double

This is a piece of code
Code: Pascal  [Select][+][-]
  1. class operator TGLZVector2d.+(constref A, B: TGLZVector2d): TGLZVector2d; assembler; nostackframe; register;
  2. asm
  3.   movapd xmm0, [A]
  4.   movapd xmm1, [B]
  5.   addpd  xmm0, xmm1
  6. end;  

This code work well under Windows but not under Linux

The strange thing is in the .S file see :

Quote
section .text.n_glzvectormath$_$tglzvector2d_$__$$_plus$tglzvector2d$tglzvector2d$$tglzvector2d
   .balign 16,0x90
.globl   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_plus$TGLZVECTOR2D$TGLZVECTOR2D$$TGLZVECTOR2D
   .type   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_plus$TGLZVECTOR2D$TGLZVECTOR2D$$TGLZVECTOR2D,@function
GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_plus$TGLZVECTOR2D$TGLZVECTOR2D$$TGLZVECTOR2D:
.Lc213:
# Var A located in register rdi
# Var B located in register rsi
# [vectormath_vector2d_unix64_sse_imp.inc]
# [4] asm
   # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 allocated
# [5] movapd xmm0, [A]
   movapd   (%rdi),%xmm0
# [6] movapd xmm1,
   movapd   (%rsi),%xmm1
# [7] addpd  xmm0, xmm1
   addpd   %xmm1,%xmm0
   # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 released
# [8] end;
   ret
   # Register xmm0,xmm1 released
.Lc214:
.Le94:
   .size   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_plus$TGLZVECTOR2D$TGLZVECTOR2D$$TGLZVECTOR2D, .Le94 - GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_plus$TGLZVECTOR2D$TGLZVECTOR2D$$TGLZVECTOR2D

the same with the use of Single type :
Quote
GLZVECTORMATH$_$TGLZVECTOR2F_$__$$_plus$TGLZVECTOR2F$TGLZVECTOR2F$$TGLZVECTOR2F:
.Lc102:
# Var A located in register rdi
# Var B located in register rsi
# Var $result located in register xmm0
# [vectormath_vector2f_unix64_sse_imp.inc]
# [4] asm
   # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 allocated
# [5] movq  xmm0, [A]
   movq   (%rdi),%xmm0
# [6] movq  xmm1,
   movq   (%rsi),%xmm1
# [7] addps xmm0, xmm1
   addps   %xmm1,%xmm0
   # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 released
# [8] end;
   ret
   # Register xmm0 released

Like we see no result allocated for Double. Have you an idea or it is a bug from FPC compiler ?

Title: Re: AVX and SSE support question
Post by: CuriousKit on April 01, 2018, 03:13:31 am
Under the System V ABI that 64-bit Linux uses, floating-point results of type Single or Double are passed via XMM0.  I don't see any fault with the code in this instance, or am I missing something?

In other news, I have finally fixed the bug with "vectorcall" where it puts Self into RCX instead of RDI on Linux, instead of silently ignoring the Windows-only calling convention.  Patch is here: https://bugs.freepascal.org/view.php?id=33542 - sorry it took so long, especially for a surprisingly simple fix.
Title: Re: AVX and SSE support question
Post by: BeanzMaster on April 03, 2018, 10:14:04 pm
Hi, CuriousKit
Quote
I don't see any fault with the code in this instance, or am I missing something?
I checked my code all seems ok

the strange thing is with function like Length wich is retunr double or round which return a TGLZVector2I all is ok. But with all function with a return type of TGLZVector2d result is not allocated :

Quote
.section .text.n_glzvectormath$_$tglzvector2d_$__$$_length$$double
   .balign 16,0x90
.globl   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_LENGTH$$DOUBLE
   .type   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_LENGTH$$DOUBLE,@function
GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_LENGTH$$DOUBLE:
.Lc247:
# Var $self located in register rdi
# Var $result located in register xmm0
# [181] asm
   # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 allocated
# [182] movapd xmm0, [RDI]
   movapd   (%rdi),%xmm0
# [183] mulpd  xmm0, xmm0
   mulpd   %xmm0,%xmm0
# [184] haddpd xmm0, xmm0
   haddpd   %xmm0,%xmm0
# [187] sqrtsd   xmm0, xmm0
   sqrtsd   %xmm0,%xmm0
   # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 released
# [188] end;
   ret
   # Register xmm0 released
.Lc248:
.Le111:
   .size   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_LENGTH$$DOUBLE, .Le111 - GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_LENGTH$$DOUBLE

.section .text.n_glzvectormath$_$tglzvector2d_$__$$_round$$tglzvector2i
   .balign 16,0x90
.globl   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_ROUND$$TGLZVECTOR2I
   .type   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_ROUND$$TGLZVECTOR2I,@function
GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_ROUND$$TGLZVECTOR2I:
.Lc257:
# Var $self located in register rdi
# Var $result located in register rax
# [234] asm
   # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 allocated
# [236] movapd   xmm0, [RDI]
   movapd   (%rdi),%xmm0

and for example the normalize function

Quote
.section .text.n_glzvectormath$_$tglzvector2d_$__$$_normalize$$tglzvector2d
   .balign 16,0x90
.globl   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_NORMALIZE$$TGLZVECTOR2D
   .type   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_NORMALIZE$$TGLZVECTOR2D,@function
GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_NORMALIZE$$TGLZVECTOR2D:
.Lc255:
# Var $self located in register rdi
# [223] asm
   # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 allocated
# [224] movapd xmm2, [RDI]
   movapd   (%rdi),%xmm2
# [225] movapd xmm0, xmm2
   movapd   %xmm2,%xmm0
# [226] mulpd  xmm2, xmm2
   mulpd   %xmm2,%xmm2
# [227] haddpd xmm2, xmm2
   haddpd   %xmm2,%xmm2
# [228] sqrtpd xmm2, xmm2
   sqrtpd   %xmm2,%xmm2
# [229] divpd  xmm0, xmm2
   divpd   %xmm2,%xmm0
   # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 released
# [230] end;
   ret
   # Register xmm0,xmm1 released
.Lc256:
.Le115:
   .size   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_NORMALIZE$$TGLZVECTOR2D, .Le115 - GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_NORMALIZE$$TGLZVECTOR2D

It's not a problem with alignment, all seems correct. It's a silly behaviours. Perhaps i'll must open a bug report

If someone can test in other Linux distro than mine (Manjaro) It will can help

Quote
In other news, I have finally fixed the bug with "vectorcall" where it puts Self into RCX instead of RDI on Linux, instead of silently ignoring the Windows-only calling convention.  Patch is here: https://bugs.freepascal.org/view.php?id=33542 - sorry it took so long, especially for a surprisingly simple fix.

No problem you've already made an awesome job with that ;)
Title: Re: AVX and SSE support question
Post by: CuriousKit on April 06, 2018, 05:16:37 pm
It might just be a missing comment in the .s file, but under Linux 64-bit and vectorcall, a return vector of 2 doubles is wholly contained within XMM0 (it might be split between XMM0 and XMM1 though if it's not aligned, which is technically incorrect for the System V ABI).  What does the disassembly show when you try to call Normalize and assign the result?
Title: Re: AVX and SSE support question
Post by: BeanzMaster on April 06, 2018, 06:38:15 pm
Hi CK, thanks it s surely the solution. I ll need add a movhlps xmm1, xmm0. I don t take care of that. I m not use Linux often. I checked the "TGLZVector4f" code it's the same issue. I'm currently not at home, i'll check tonight and tell you 👍
Thanks
Title: Re: AVX and SSE support question
Post by: BeanzMaster on April 06, 2018, 10:05:57 pm
All test are green now thanks again CuriousKit  8-)
Title: Re: AVX and SSE support question
Post by: CuriousKit on April 06, 2018, 10:50:34 pm
No problem at all.

Note: In FPC 3.0.4, it is definitely split between XMM0 and XMM1 for Linux 64-bit.  When FPC 3.1.1 is released, the result for a vector of 2 doubles will likely just be contained within XMM0 and hence your code will require updating.
Title: Re: AVX and SSE support question
Post by: BeanzMaster on April 06, 2018, 11:45:52 pm
Quote
Note: In FPC 3.0.4, it is definitely split between XMM0 and XMM1 for Linux 64-bit.  When FPC 3.1.1 is released, the result for a vector of 2 doubles will likely just be contained within XMM0 and hence your code will require updating.

Yes i've made some test from Trunk  under windows it's promising (not with vectorcall yet). But FPC 3.1.1 it's clearly better point of view performances
TinyPortal © 2005-2018