Recent

Author Topic: AVX and SSE support question  (Read 89903 times)

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: AVX and SSE support question
« Reply #15 on: November 18, 2017, 08:35:55 pm »
FYI my earlier not working attempts:

Code: Pascal  [Select][+][-]
  1.      program DXMathTest;
  2.   {$mode delphi}    
  3.     type
  4.       TXMFloat4 = record
  5.         sX:array[0..3] of single;
  6.         class operator Add(constref A, B: TXMFloat4): TXMFloat4; inline;
  7.         property x : single read sx[0] write sx[0];
  8.         property y : single read sx[1] write sx[1];
  9.         property z : single read sx[2] write sx[2];
  10.         property w : single read sx[3] write sx[3];
  11.       end;
  12.      
  13.       class operator TXMFloat4.Add(constref A, B: TXMFloat4): TXMFloat4;
  14.       begin
  15.         result.Sx:=a.sX+b.sX;
  16.       end;
  17.      
  18.      var x,y,z : txmfloat4;
  19.     begin
  20.       x.x:=1; x.y:=2; x.z:=3; x.w:=4;
  21.       y.x:=5; y.y:=6; y.z:=7; y.w:=8;
  22.       z:=x+y;
  23.       writeln(z.x);
  24.    
  25.     end.

and compile with

fpc -CfAVX -CpCOREAVX  -O3 -Sv -OpCOREAVX -OoFASTMATH avstest4 -Si -al

It doesn't work because no code generation is done for storing the value in result.  (so a+b goes fine, but assigning that to result (z) goes wrong, both inlined and not:

Code: Pascal  [Select][+][-]
  1. # [22] z:=x+y;
  2.         movdqa  U_$P$DXMATHTEST_$$_X(%rip),%xmm0
  3.         addps   U_$P$DXMATHTEST_$$_Y(%rip),%xmm0
  4.         vmovups 40(%rsp),%xmm0
  5.         vmovups %xmm0,U_$P$DXMATHTEST_$$_Z(%rip)
  6.  
« Last Edit: November 18, 2017, 08:38:39 pm by marcov »

Akira1364

  • Hero Member
  • *****
  • Posts: 561
Re: AVX and SSE support question
« Reply #16 on: November 18, 2017, 09:53:23 pm »
Interesting! Definitely seems to work fine with free-standing values in the record (not properties), though. I also just tested the following version which makes it a variant record with both free values and a static array:

Code: Pascal  [Select][+][-]
  1. program DXMathTest;
  2.  
  3. {$mode Delphi}
  4.  
  5. uses
  6.   SysUtils;
  7.  
  8. type
  9.   TXMFloat4 = record
  10.     class operator Add(constref A, B: TXMFloat4): TXMFloat4; inline;
  11.     case Byte of
  12.       0: (X, Y, Z, W: Single);
  13.       1: (V4: array[0..3] of Single);
  14.   end;
  15.  
  16.   class operator TXMFloat4.Add(constref A, B: TXMFloat4): TXMFloat4;
  17.   begin
  18.     with Result do
  19.     begin
  20.       X := A.X + B.X;
  21.       Y := A.Y + B.Y;
  22.       Z := A.Z + B.Z;
  23.       W := A.W + B.W;
  24.     end;
  25.   end;
  26.  
  27. const
  28.   A: TXMFloat4 = (X: 5.0; Y: 5.0; Z: 5.0; W: 0.5);
  29.   B: TXMFloat4 = (X: 2.5; Y: 2.5; Z: 2.5; W: 0.5);
  30.  
  31. var
  32.   C: TXMFloat4;
  33.  
  34. begin
  35.   C := A + B;
  36.   with C do
  37.   begin
  38.     WriteLn(X.ToString + #32 + Y.ToString
  39.             + #32 + Z.ToString + #32 + W.ToString + #13);
  40.     WriteLn(V4[0].ToString + #32 + V4[1].ToString
  41.             + #32 + V4[2].ToString + #32 + V4[3].ToString);
  42.   end;
  43.   ReadLn;
  44. end.


Used the same compiler settings as before and had no issues there, either. Both WriteLns print out ('7.5, 7.5, 7.5, 1') as expected.
« Last Edit: November 18, 2017, 09:56:46 pm by Akira1364 »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: AVX and SSE support question
« Reply #17 on: November 18, 2017, 09:58:19 pm »
You have to add the array form to make -Sv work, so that the add is one instruction for all 4 singles.

Akira1364

  • Hero Member
  • *****
  • Posts: 561
Re: AVX and SSE support question
« Reply #18 on: November 18, 2017, 10:11:31 pm »
Ah, didn't realize you were focusing specifically on the -Sv functionality there. Also didn't know -Sv was limited to arrays... I was under the impression that it was more of a general "hint" for the compiler to attempt to auto-vectorize where possible.

Still though, even without -Sv, for Sonny's purposes I think simply adding constref to his existing method will get him a lot closer performance-wise to where he wants (while still actually working), as my test showed. From 50-something lines of ASM down to 23 isn't bad at all IMO.
« Last Edit: November 19, 2017, 06:16:03 am by Akira1364 »

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #19 on: November 19, 2017, 11:07:25 pm »
Hi i've play a little bit with SSE and AVX

This is the code i've used

Code: Pascal  [Select][+][-]
  1.     TBZVector4f = packed record
  2.       case integer of
  3.        0 : (V:Array[0..3] of Single);
  4.        1 : (X, Y, Z, W : single);
  5.     end;
  6.     TBZVector = TBZVector4f;
  7.  
  8. function nc_VectorAdd(ConstRef AVector, AVector2: TBZVector):TBZVector;
  9. begin
  10.  result.x:=AVector.x + AVector2.x;
  11.  result.y:=AVector.y + AVector2.y;
  12.  result.z:=AVector.z + AVector2.z;
  13.  result.w :=AVector.w+ AVector2.w;
  14. end;
  15.  
  16. function asm_sse_VectorAdd(ConstRef V1, V2: TBZVector):TBZVector; assembler;nostackframe;register;
  17. asm
  18.   movups xmm0,[V1]
  19.   movups xmm1,[V2]
  20.   addps xmm0,xmm1
  21.   movups [RESULT], XMM0
  22. end;
  23.  
  24. function asm_avx_VectorAdd(ConstRef V1, V2: TBZVector):TBZVector; assembler;nostackframe;register;
  25. asm
  26.   vmovups xmm0,[V1]
  27.   vmovups xmm1,[V2]
  28.   vaddps xmm0,xmm1, xmm0
  29.   vmovups [RESULT], XMM0
  30. end;

The code for the test,  a simple loop with two vectors Initialized before of course like this

Code: Pascal  [Select][+][-]
  1.  
  2.   v1:=VectorMake(1.198,1.264,1.387);
  3.   v2:=VectorMake(2.542,2.289,2.311);  
  4.  
  5. For i:=0 to 9999999 do
  6. begin
  7.     V:=asm_avx_VectorAdd(V1,V2);
  8. end;

and with there compiler options :

Quote
   
    -al
    -O3
    -Sv
    -OoFASTMATH
    -CfAVX
    -CpCOREAVX
    -OpCOREAVX
    -CPPACKRECORD=8

Finally the result :

    - AVX     =  : 14318.3959415555 µs
    - SSE     =  : 15241.6712796688 µs
    - NATIF   =  : 23578.505629003 µs

In conclusion without this options :

Quote

 -CfAVX
 -CpCOREAVX
 -OpCOREAVX

1st - The SSE performance  fall down and native code is better 
2nd - Without "NoStackFrame and Register" the performance decrease (both with SSE and AVX)
3rd - By using movaps, vmovaps instead of movups, vmovups,  not make a big differences

Now i don't check the output in the windows assembly so....next time





Akira1364

  • Hero Member
  • *****
  • Posts: 561
Re: AVX and SSE support question
« Reply #20 on: November 20, 2017, 12:38:18 am »
You could even use the assembler approach within an operator to replace the apparently buggy -Sv functionality, by the way. I tested your asm (with VMOVAPS instead of VMOVUPS as the alignment is already known/correct by default in FPC) with my example from yesterday and the results were still valid/exactly as expected:

Code: Pascal  [Select][+][-]
  1. program DXMathTest;
  2.  
  3. {$modeswitch AdvancedRecords}
  4.  
  5. uses
  6.   SysUtils;
  7.  
  8. type
  9.   TXMFloat4 = record
  10.     class operator +(constref A, B: TXMFloat4): TXMFloat4; assembler;
  11.     case Byte of
  12.       0: (X, Y, Z, W: Single);
  13.       1: (V4: array[0..3] of Single);
  14.   end;
  15.  
  16.   class operator TXMFloat4.+(constref A, B: TXMFloat4): TXMFloat4; assembler;
  17.   asm
  18.     VMOVAPS XMM0,[A]
  19.     VMOVAPS XMM1,[B]
  20.     VADDPS XMM0,XMM1, XMM0
  21.     VMOVAPS [RESULT], XMM0
  22.   end;
  23.  
  24. const
  25.   A: TXMFloat4 = (X: 5.0; Y: 5.0; Z: 5.0; W: 0.5);
  26.   B: TXMFloat4 = (X: 2.5; Y: 2.5; Z: 2.5; W: 0.5);
  27.  
  28. var
  29.   C: TXMFloat4;
  30.  
  31. begin
  32.   C := A + B;
  33.   with C do
  34.   begin
  35.     WriteLn(X.ToString, #32, Y.ToString,
  36.             #32, Z.ToString, #32, W.ToString);
  37.     WriteLn(V4[0].ToString, #32, V4[1].ToString,
  38.             #32, V4[2].ToString, #32, V4[3].ToString);
  39.   end;
  40.   ReadLn;
  41. end.

Note that I didn't declare the record as packed, as doing so (at least in this case) doesn't actually seem to do anything at all. The assembler output with it packed and unpacked was completely identical.

I used slightly different, simpler compiler options this time, as well:
Code: Pascal  [Select][+][-]
  1. -al -CfAVX2 -CpCOREAVX2 -O4 -OpCOREAVX2

That being said though, it certainly seems like -Sv is very very close to working properly and probably isn't missing a huge amount of code. Anyone know exactly where in the compiler codebase it's implemented?
« Last Edit: November 20, 2017, 04:24:05 pm by Akira1364 »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: AVX and SSE support question
« Reply #21 on: November 20, 2017, 07:04:33 am »
And -Sv can be inlined, so sequences of such operations would be more optimal.

I don't know much about compiler internals, but I usually lookup the commandline parsing in options.pas to see what the switch (-Sv) sets, and then grep for that


Nitorami

  • Sr. Member
  • ****
  • Posts: 481
Re: AVX and SSE support question
« Reply #22 on: November 20, 2017, 10:11:26 am »
As to the -Sv switch, I learned that it had been introduced a few years ago, but there is nobody to maintain it, and there are no test cases, so it is probably broken. Al least this is what Jonas said somewhere in the bug tracker on this topic. All -Sv seems to do is to give you a false impression of working because it makes the compiler accept operations with vectors of floats, but the result is garbage. If anyone can produce a working example using the -Sv switch, I would be very interested.

Thaddy

  • Hero Member
  • *****
  • Posts: 14201
  • Probably until I exterminate Putin.
Re: AVX and SSE support question
« Reply #23 on: November 20, 2017, 10:26:18 am »
-Sv *does* work, but I have to test if it works with anything else than a fixed array with a size determined by power of two.
For which I don't have any code yet, because I assumed ...it had to be. I only use it for audio.
If you examine the assembler output (-s) at least on X64 and armhf) you will see it definitely works.
Specialize a type, not a var.

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #24 on: November 20, 2017, 04:42:26 pm »
Hi i've made another test based on the code by Akira

Code: Pascal  [Select][+][-]
  1.  
  2. Unit Unit1;
  3.  
  4. {$mode objfpc}{$H+}
  5. {$DEFINE USE_ASM}
  6. {.$DEFINE USE_SSE_ASM}
  7. {$DEFINE USE_AVX_ASM}
  8. {$MODESWITCH ADVANCEDRECORDS}
  9.  
  10. Interface
  11.  
  12. Uses
  13.   Classes, Sysutils, Fileutil, Forms, Controls, Graphics, Dialogs, StdCtrls;
  14.  
  15.  
  16.  
  17. Type
  18.   { Tform1 }
  19.   Tform1 = Class(Tform)
  20.     Button1 : Tbutton;
  21.     Memo1 : Tmemo;
  22.     Procedure Button1click(Sender : Tobject);
  23.   Private
  24.  
  25.   Public
  26.  
  27.   End;
  28.  
  29. type
  30.   TGLZVector3fType = array[0..2] of Single;
  31.   TGLZVector4fType = array[0..3] of Single;
  32.  
  33.   TGLZVector3f = record
  34.     case Byte of
  35.       0: (X, Y, Z: Single);
  36.       1: (V: TGLZVector3fType);
  37.   End;
  38.  
  39.   TGLZVector4f = record
  40.     public
  41.       procedure Create(Const aX,aY,aZ,aW : Single);
  42.       function ToString : String;
  43.  
  44.       class operator +(constref A, B: TGLZVector4f): TGLZVector4f; overload;
  45.       class operator -(constref A, B: TGLZVector4f): TGLZVector4f; overload;
  46.       class operator *(constref A, B: TGLZVector4f): TGLZVector4f; overload;
  47.       class operator /(constref A, B: TGLZVector4f): TGLZVector4f; overload;
  48.  
  49.       class operator +(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; overload;
  50.       class operator -(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; overload;
  51.       class operator *(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; overload;
  52.       class operator /(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; overload;
  53.  
  54.       case Byte of
  55.         0: (X, Y, Z, W: Single);
  56.         1: (V: TGLZVector4fType);
  57.         2: (AsVector3f : TGLZVector3f);
  58.   end;
  59.  
  60.  
  61. Var
  62.   Form1 : Tform1;
  63.  
  64. Implementation
  65.  
  66. {$R *.lfm}
  67.  
  68. { Tform1 }
  69.  
  70. Procedure Tform1.Button1click(Sender : Tobject);
  71. Var
  72.   V1, V2, V3 : TGLZVector4f;
  73.   Float : Single;
  74. Begin
  75.   Float := 1.5;
  76.   v1.Create(5.0,5.0,5.0,0.5);
  77.   v2.Create(2.5,2.5,2.5,0.5);
  78.  
  79.   Memo1.Lines.Add('V1 = '+v1.ToString);
  80.   Memo1.Lines.Add('V2 = '+v2.ToString);
  81.   Memo1.Lines.Add('Float = 1.5');
  82.   Memo1.Lines.Add('');
  83.   Memo1.Lines.Add('Operations : ');
  84.   Memo1.Lines.Add('----------------------------------------');
  85.   V3 := V1 + V2;
  86.   Memo1.Lines.Add('V3 = V1 + V2 = '+v3.ToString);
  87.   V3 := V1 - V2;
  88.   Memo1.Lines.Add('V3 = V1 - V2 = '+v3.ToString);
  89.   V3 := V1 * V2;
  90.   Memo1.Lines.Add('V3 = V1 * V2 = '+v3.ToString);
  91.   V3 := V1 / V2;
  92.   Memo1.Lines.Add('V3 = V1 / V2 = '+v3.ToString);
  93.   Memo1.Lines.Add('----------------------------------------');
  94.   V3 := V1 + Float;
  95.   Memo1.Lines.Add('V3 = V1 + Float = '+v3.ToString);
  96.   V3 := V1 - Float;
  97.   Memo1.Lines.Add('V3 = V1 - Float = '+v3.ToString);
  98.   V3 := V1 * Float;
  99.   Memo1.Lines.Add('V3 = V1 * Float = '+v3.ToString);
  100.   V3 := V1 / Float;
  101.   Memo1.Lines.Add('V3 = V1 / Float = '+v3.ToString);
  102. End;
  103.  
  104.  
  105. procedure TGLZVector4f.Create(Const aX,aY,aZ,aW : Single);
  106. begin
  107.    Self.X := AX;
  108.    Self.Y := AY;
  109.    Self.Z := AZ;
  110.    Self.W := AW;
  111. end;
  112.  
  113. function TGLZVector4f.ToString : String;
  114. begin
  115.    Result := '(X: '+FloattoStrF(Self.X,fffixed,5,5)+
  116.             ' ,Y: '+FloattoStrF(Self.Y,fffixed,5,5)+
  117.             ' ,Z: '+FloattoStrF(Self.Z,fffixed,5,5)+
  118.             ' ,W: '+FloattoStrF(Self.W,fffixed,5,5)+')';
  119. End;
  120.  
  121. {$IFDEF USE_ASM}
  122. {$IFDEF USE_AVX_ASM}
  123. class operator TGLZVector4f.+(constref A, B: TGLZVector4f): TGLZVector4f; assembler;
  124. asm
  125.   VMOVUPS XMM0,[A]
  126.   VMOVUPS XMM1,[B]
  127.   VADDPS  XMM0,XMM1, XMM0
  128.   VMOVUPS [RESULT], XMM0
  129. end;
  130.  
  131. class operator TGLZVector4f.-(constref A, B: TGLZVector4f): TGLZVector4f; assembler;
  132. asm
  133.   VMOVUPS XMM0,[A]
  134.   VMOVUPS XMM1,[B]
  135.   VSUBPS  XMM0,XMM1, XMM0
  136.   VMOVUPS [RESULT], XMM0
  137. end;
  138.  
  139. class operator TGLZVector4f.*(constref A, B: TGLZVector4f): TGLZVector4f; assembler;
  140. asm
  141.   VMOVUPS XMM0,[A]
  142.   VMOVUPS XMM1,[B]
  143.   VMULPS  XMM0,XMM1, XMM0
  144.   VMOVUPS [RESULT], XMM0
  145. end;
  146.  
  147. class operator TGLZVector4f./(constref A, B: TGLZVector4f): TGLZVector4f; assembler;
  148. asm
  149.   VMOVUPS XMM0,[A]
  150.   VMOVUPS XMM1,[B]
  151.   VDIVPS  XMM0,XMM1, XMM0
  152.   VMOVUPS [RESULT], XMM0
  153. end;
  154.  
  155. class operator TGLZVector4f.+(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler;
  156. asm
  157.   VMOVUPS XMM0,[A]
  158.   VMOVSS  XMM1,[B]
  159.   VSHUFPS XMM1, XMM1, XMM1,0
  160.   VADDPS  XMM0,XMM1, XMM0
  161.   VMOVUPS [RESULT], XMM0
  162. end;
  163.  
  164. class operator TGLZVector4f.-(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler;
  165. asm
  166.   VMOVUPS XMM0,[A]
  167.   VMOVSS  XMM1,[B]
  168.   VSHUFPS XMM1, XMM1, XMM1,0
  169.   VSUBPS  XMM0,XMM1, XMM0
  170.   VMOVUPS [RESULT], XMM0
  171. end;
  172.  
  173. class operator TGLZVector4f.*(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler;
  174. asm
  175.   VMOVUPS XMM0,[A]
  176.   VMOVSS  XMM1,[B]
  177.   VSHUFPS XMM1, XMM1, XMM1,0
  178.   VMULPS  XMM0,XMM1, XMM0
  179.   VMOVUPS [RESULT], XMM0
  180. end;
  181.  
  182. class operator TGLZVector4f./(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler;
  183. asm
  184.   VMOVUPS XMM0,[A]
  185.   VMOVSS  XMM1,[B]
  186.   VSHUFPS XMM1, XMM1, XMM1,0
  187.   VDIVPS  XMM0,XMM1, XMM0
  188.   VMOVUPS [RESULT], XMM0
  189. end;
  190.  
  191. {$ENDIF}
  192.  
  193. {$IFDEF USE_SSE_ASM}
  194. class operator TGLZVector4f.+(constref A, B: TGLZVector4f): TGLZVector4f; assembler;
  195. asm
  196.   MOVUPS XMM0,[A]
  197.   MOVUPS XMM1,[B]
  198.   ADDPS  XMM0,XMM1
  199.   MOVUPS [RESULT], XMM0
  200. end;
  201.  
  202. class operator TGLZVector4f.-(constref A, B: TGLZVector4f): TGLZVector4f; assembler;
  203. asm
  204.   MOVUPS XMM0,[A]
  205.   MOVUPS XMM1,[B]
  206.   SUBPS  XMM0,XMM1
  207.   MOVUPS [RESULT], XMM0
  208. end;
  209.  
  210. class operator TGLZVector4f.*(constref A, B: TGLZVector4f): TGLZVector4f; assembler;
  211. asm
  212.   MOVUPS XMM0,[A]
  213.   MOVUPS XMM1,[B]
  214.   MULPS  XMM0,XMM1
  215.   MOVUPS [RESULT], XMM0
  216. end;
  217.  
  218. class operator TGLZVector4f./(constref A, B: TGLZVector4f): TGLZVector4f; assembler;
  219. asm
  220.   MOVUPS XMM0,[A]
  221.   MOVUPS XMM1,[B]
  222.   DIVPS  XMM0,XMM1
  223.   MOVUPS [RESULT], XMM0
  224. end;
  225.  
  226. class operator TGLZVector4f.+(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler;
  227. asm
  228.   MOVUPS XMM0,[A]
  229.   MOVSS  XMM1,[B]
  230.   SHUFPS XMM1, XMM1,0
  231.   ADDPS  XMM0,XMM1
  232.   MOVUPS [RESULT], XMM0
  233. end;
  234.  
  235. class operator TGLZVector4f.-(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler;
  236. asm
  237.   MOVUPS XMM0,[A]
  238.   MOVSS  XMM1,[B]
  239.   SHUFPS XMM1, XMM1,0
  240.   SUBPS  XMM0,XMM1
  241.   MOVUPS [RESULT], XMM0
  242. end;
  243.  
  244. class operator TGLZVector4f.*(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler;
  245. asm
  246.   MOVUPS XMM0,[A]
  247.   MOVSS  XMM1,[B]
  248.   SHUFPS XMM1, XMM1,0
  249.   MULPS  XMM0,XMM1
  250.   MOVUPS [RESULT], XMM0
  251. end;
  252.  
  253. class operator TGLZVector4f./(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler;
  254. asm
  255.   MOVUPS XMM0,[A]
  256.   MOVSS  XMM1,[B]
  257.   SHUFPS XMM1, XMM1,0
  258.   DIVPS  XMM0,XMM1
  259.   MOVUPS [RESULT], XMM0
  260. end;
  261.  
  262.  
  263. {$ENDIF}
  264. {$ELSE}
  265. class operator TGLZVector4f.+(constref A, B: TGLZVector4f): TGLZVector4f;
  266. begin
  267.   Result.X := A.X + B.X;
  268.   Result.X := A.Y + B.Y;
  269.   Result.X := A.Z + B.Z;
  270.   Result.X := A.W + B.W;
  271. end;
  272. {$ENDIF}            
  273.  
Results with :

V1 = (X: 5.00000 ,Y: 5.00000 ,Z: 5.00000 ,W: 0.50000)
V2 = (X: 2.50000 ,Y: 2.50000 ,Z: 2.50000 ,W: 0.50000)
Float = 1.5

SSE Operations :
----------------------------------------
V3 = V1 + V2 = (X: 7.50000 ,Y: 7.50000 ,Z: 7.50000 ,W: 1.00000) OK
V3 = V1 - V2 = (X: 2.50000 ,Y: 2.50000 ,Z: 2.50000 ,W: 0.00000)OK
V3 = V1 * V2 = (X: 12.50000 ,Y: 12.50000 ,Z: 12.50000 ,W: 0.25000) OK
V3 = V1 / V2 = (X: 2.00000 ,Y: 2.00000 ,Z: 2.00000 ,W: 1.00000) OK
----------------------------------------
V3 = V1 + Float = (X: 6.50000 ,Y: 6.50000 ,Z: 6.50000 ,W: 2.00000) OK
V3 = V1 - Float = (X: 3.50000 ,Y: 3.50000 ,Z: 3.50000 ,W: -1.00000)OK
V3 = V1 * Float = (X: 7.50000 ,Y: 7.50000 ,Z: 7.50000 ,W: 0.75000) OK
V3 = V1 / Float = (X: 3.33333 ,Y: 3.33333 ,Z: 3.33333 ,W: 0.33333) OK
AVX Operations :
----------------------------------------
V3 = V1 + V2 = (X: 7.50000 ,Y: 7.50000 ,Z: 7.50000 ,W: 1.00000) OK
V3 = V1 - V2 = (X: -2.50000 ,Y: -2.50000 ,Z: -2.50000 ,W: 0.00000) NOK
V3 = V1 * V2 = (X: 12.50000 ,Y: 12.50000 ,Z: 12.50000 ,W: 0.25000) OK
V3 = V1 / V2 = (X: 0.50000 ,Y: 0.50000 ,Z: 0.50000 ,W: 1.00000) NOK
----------------------------------------
V3 = V1 + Float = (X: 6.50000 ,Y: 6.50000 ,Z: 6.50000 ,W: 2.00000) OK
V3 = V1 - Float = (X: -3.50000 ,Y: -3.50000 ,Z: -3.50000 ,W: 1.00000) NOK
V3 = V1 * Float = (X: 7.50000 ,Y: 7.50000 ,Z: 7.50000 ,W: 0.75000) b]OK[/b]
V3 = V1 / Float = (X: 0.30000 ,Y: 0.30000 ,Z: 0.30000 ,W: 3.00000) NOK

Like you see Substraction and Division give wrong results with AVX, with sub it give negative values  %)
 I have probably forgotten something that I do not understand yet  :-[

If someone can explain where is my error, it will be cool.

Thanks

PS: Tested with  : -al -CfAVX2 -CpCOREAVX2 -O4 -OpCOREAVX2
and  -al -CfAVX -CpCOREAVX -O4 -OpCOREAVX

Thaddy

  • Hero Member
  • *****
  • Posts: 14201
  • Probably until I exterminate Putin.
Re: AVX and SSE support question
« Reply #25 on: November 20, 2017, 04:53:29 pm »
Well, apart from the inconsequential command line.....< >:D>
Your problem is with V3.
It would be really helpful if you examine the actual assembler output. So: -a option and examine .s

Also note you need to pack both arrays and records...
« Last Edit: November 20, 2017, 05:04:47 pm by Thaddy »
Specialize a type, not a var.

Akira1364

  • Hero Member
  • *****
  • Posts: 561
Re: AVX and SSE support question
« Reply #26 on: November 20, 2017, 05:20:25 pm »
You just need to invert A and B in the assembler for subtraction and division. Basically it should look like this:

Code: Pascal  [Select][+][-]
  1.   class operator TXMFloat4.+(constref A, B: TXMFloat4): TXMFloat4; assembler;
  2.   asm
  3.     VMOVAPS XMM0,[A]
  4.     VMOVAPS XMM1,[B]
  5.     VADDPS XMM0,XMM1, XMM0
  6.     VMOVAPS [RESULT], XMM0
  7.   end;
  8.  
  9.   class operator TXMFloat4.-(constref A, B: TXMFloat4): TXMFloat4; assembler;
  10.   asm
  11.     VMOVAPS XMM0,[B]
  12.     VMOVAPS XMM1,[A]
  13.     VSUBPS XMM0,XMM1, XMM0
  14.     VMOVAPS [RESULT], XMM0
  15.   end;
  16.  
  17.   class operator TXMFloat4./(constref A, B: TXMFloat4): TXMFloat4; assembler;
  18.   asm
  19.     VMOVAPS XMM0,[B]
  20.     VMOVAPS XMM1,[A]
  21.     VDIVPS XMM0,XMM1, XMM0
  22.     VMOVAPS [RESULT], XMM0
  23.   end;

Note how B goes first in the second two operators. Tested both of these and again, they work fine. Nothing declared as packed, also, because like I said before as far as I can tell it has absolutely no effect on anything in this case.
« Last Edit: November 20, 2017, 05:36:30 pm by Akira1364 »

Thaddy

  • Hero Member
  • *****
  • Posts: 14201
  • Probably until I exterminate Putin.
Re: AVX and SSE support question
« Reply #27 on: November 20, 2017, 05:50:08 pm »
Nothing declared as packed, also, because like I said before as far as I can tell it has absolutely no effect on anything in this case.
As far as you can tell?... Always use packed when you want to defeat the compiler by using inline assembler.... Otherwise you are in trouble before you know it.
Anyway, I will check pure pascal against its assembler output myself, just curious....
Specialize a type, not a var.

BeanzMaster

  • Sr. Member
  • ****
  • Posts: 268
Re: AVX and SSE support question
« Reply #28 on: November 20, 2017, 10:32:41 pm »
Well, apart from the inconsequential command line.....< >:D>
Your problem is with V3.
It would be really helpful if you examine the actual assembler output. So: -a option and examine .s

Also note you need to pack both arrays and records...

Thanks on your advises i packed arrays and record. I also check the .s file the generated code is the same  :D

You just need to invert A and B in the assembler for subtraction and division. Basically it should look like this:

Code: Pascal  [Select][+][-]
  1.   class operator TXMFloat4.-(constref A, B: TXMFloat4): TXMFloat4; assembler;
  2.   asm
  3.     VMOVAPS XMM0,[B]
  4.     VMOVAPS XMM1,[A]
  5.     VSUBPS XMM0,XMM1, XMM0
  6.     VMOVAPS [RESULT], XMM0
  7.   end;
  8.  

Note how B goes first in the second two operators. Tested both of these and again, they work fine. Nothing declared as packed, also, because like I said before as far as I can tell it has absolutely no effect on anything in this case.
Yes it was the problem, the inversion, but you tricks is not the real solution ( don't work with  the 2nd overloaded operators (V:TheVector;F:Single)
I've made some research on instructions and test So the good is :
[c]VSUBPS XMM0,XMM0, XMM1[/c] where the 1st param is the result and not the 3rd as I thought.
Now the results are ok

Thanks



Nitorami

  • Sr. Member
  • ****
  • Posts: 481
Re: AVX and SSE support question
« Reply #29 on: November 20, 2017, 11:12:42 pm »
I took this assembler code and tried to make a small benchmark. The results seemed ok first, although performance was not significantly better than with native FPC code... but when I start modifying code parts outside the assembler routines, I get access violations whcih are reproducible but appear random. E.g. the below works if calling subtract() from within a loop but otherwise it crashes... it also crashes when printing the results after the subtraction. This is rather dubious, and I guess there may be a bit more to consider when using assembler.


Code: Pascal  [Select][+][-]
  1. {$ASMMODE INTEL}
  2. uses sysutils;
  3. type float4 = packed array [0..3] of single;
  4.  
  5.  
  6. function Subtract (constref A, B: float4): float4; assembler; inline;
  7. asm
  8.   VMOVAPS XMM0,[B]
  9.   VMOVAPS XMM1,[A]
  10.   VSUBPS XMM0,XMM1, XMM0
  11.   VMOVAPS [RESULT], XMM0
  12. end;
  13.  
  14.  
  15. var c: float4;
  16.     n: integer;
  17. const a : float4 = (1,2,3,4);
  18. const b : float4 = (5,6,7,8);
  19.  
  20. begin
  21. //  c := subtract (a,b);  //outside the loop -> crash
  22. // for n := 1 to 1 do  c := Subtract (a,b); //in the loop it does not crash
  23. end.
  24.  
  25.  

 

TinyPortal © 2005-2018