Recent

Author Topic: x64 32bit floating point operations is slower than x86  (Read 1910 times)

kagamma

  • New member
  • *
  • Posts: 8
x64 32bit floating point operations is slower than x86
« on: January 19, 2021, 03:56:43 am »
So I decide to do a small benchmark on floating point operations and notice it's performance is worse when compile for x86_64 target.

Specs:
OS: Windows 10
CPU: Intel Core i5-9400F
FPC version: 3.2.0-r45643

command line (x64): fpc -Twin64 -Px86_64 -CfSSE42 -CpCOREI -O4 -Sv test.pas
command line (x86): fpc -CfSSE42 -CpCOREI -O4 -Sv test.pas

Code:
Code: Pascal  [Select][+][-]
  1. {$mode objfpc}
  2. {$asmmode intel}
  3. {$assertions on}
  4.  
  5. uses
  6.   SysUtils;
  7.  
  8. type
  9.   TVector3 = record
  10.     X, Y, Z: Single;
  11.   end;
  12.  
  13. function Mult(const V1, V2: TVector3): TVector3;
  14. begin
  15.   Result.X := V1.X * V2.X;
  16.   Result.Y := V1.Y * V2.Y;
  17.   Result.Z := V1.Z * V2.Z;
  18. end;
  19.  
  20. function Mult_SSE(constref V1, V2: TVector3): TVector3; assembler; nostackframe;
  21. asm
  22.   movhps xmm0,qword ptr [V1 + 4]
  23.   movlps xmm0,qword ptr [V1]
  24.   movhps xmm1,qword ptr [V2 + 4]
  25.   movlps xmm1,qword ptr [V2]
  26.   mulps  xmm0,xmm1
  27.   movhps qword ptr [Result + 4],xmm0
  28.   movlps qword ptr [Result],xmm0
  29. end;
  30.  
  31. var
  32.   V, V1, V2: TVector3;
  33.   I, Tick: Integer;
  34.  
  35. begin
  36.   V1.X := 1; V1.Y := 2; V1.Z := 3;
  37.   V2.X := 4; V2.Y := 5; V2.Z := 6;
  38.   Write('fpc: ');
  39.   Tick := GetTickCount64;
  40.   for I := 0 to 99999999 do
  41.     V := Mult(V1, V2);
  42.   assert(V.X = 1*4);
  43.   assert(V.Y = 2*5);
  44.   assert(V.Z = 3*6);
  45.   Writeln(GetTickCount64 - Tick, 'ms');
  46.   Write('Hand code: ');
  47.   Tick := GetTickCount64;
  48.   for I := 0 to 99999999 do
  49.     V := Mult_SSE(V1, V2);
  50.   Writeln(GetTickCount64 - Tick, 'ms');
  51.   assert(V.X = 1*4);
  52.   assert(V.Y = 2*5);
  53.   assert(V.Z = 3*6);
  54. end.

Result:
- fpc (x86): ~203ms
- fpc (x64): ~500ms
- Hand code (x64): ~156ms

Notice the x86 fpc result is quite close to hand code SSE2 version of Mult function, whenever the x64 fpc result is, well  :-[

Edit: After looking at the asm output, x86 and x64 generate the same assembly code. Further more if I convert TVector3 to TVector4, then x86 and x64 have the same speed. So I guess it's more about memory access performance than the code itself.
« Last Edit: January 19, 2021, 05:26:05 am by kagamma »

mika

  • New Member
  • *
  • Posts: 33
Re: x64 32bit floating point operations is slower than x86
« Reply #1 on: March 09, 2021, 02:47:43 pm »
1. for fpc version you use
Code: Pascal  [Select][+][-]
  1. function Mult(const V1, V2: TVector3): TVector3;
it will give better results
Code: Pascal  [Select][+][-]
  1. function Mult(constref V1, V2: TVector3): TVector3;

2. reason for bad results for 64bit pascal implementation is the way how parameters are passed and result returned.

3. I did some testing my self
os : linux 64bit

Code: Pascal  [Select][+][-]
  1.     {$mode objfpc}
  2.     {$asmmode intel}
  3.     {$assertions on}
  4.  
  5.     uses
  6.       SysUtils;
  7.  
  8.     type
  9.       TVector3 = record
  10.         X, Y, Z: Single;
  11.       end;
  12.  
  13.     function Mult(constref V1, V2: TVector3): TVector3;
  14.     begin
  15.  
  16.       Result.X := V1.X * V2.X;
  17.       Result.Y := V1.Y * V2.Y;
  18.       Result.Z := V1.Z * V2.Z;
  19.     end;
  20.  
  21.     procedure MultPr(constref V1, V2: TVector3; var  Result: TVector3);
  22.     begin
  23.       Result.X := V1.X * V2.X;
  24.       Result.Y := V1.Y * V2.Y;
  25.       Result.Z := V1.Z * V2.Z;
  26.     end;
  27.  
  28.     function Mult_SSE(constref V1, V2: TVector3): TVector3; assembler; nostackframe;
  29.     asm
  30.       movdqu xmm0, [v1]
  31.       movdqu xmm1, [v2]
  32.       mulps  xmm0 , xmm1
  33.       movdqu xmm1, xmm0
  34.       psrldq xmm1,8
  35.     end;
  36.  
  37.     procedure Mult_SSEPr(constref V1, V2: TVector3; var Result: TVector3); assembler; nostackframe;
  38.     asm
  39.      {$if sizeof(TVector3)=12}
  40.  
  41.       movq xmm0, [v1]
  42.       movq xmm1, [v2]
  43.       mulps  xmm0 , xmm1
  44.       movq [Result],xmm0
  45.  
  46.       movss xmm0, [v1+8]
  47.       movss xmm1, [v2+8]
  48.       mulss  xmm0 , xmm1
  49.       movss [Result+8],xmm0
  50.  
  51.       {$else}
  52.       {$if sizeof(TVector3)=16}
  53.       movdqu xmm0, [v1]
  54.       movdqu xmm1, [v2]
  55.       mulps  xmm0 , xmm1
  56.       movdqu [Result],xmm0
  57.       {$else}
  58.           {$fatal  sizeof(TVector3) has to be 12 or 16 bytes }
  59.       {$endif}
  60.       {$endif}
  61.     end;
  62.  
  63.  
  64.     var
  65.       V, V1, V2: TVector3;
  66.       I, Tick: Integer;
  67.  
  68.     begin
  69.       writeln('Size of TVector3 ', sizeof(TVector3),' bytes');
  70.       V1.X := 1; V1.Y := 2; V1.Z := 3;
  71.       V2.X := 4; V2.Y := 5; V2.Z := 6;
  72.       Write('fpc function: ');
  73.       Tick := GetTickCount64;
  74.       for I := 0 to 99999999 do
  75.         V := Mult(V1, V2);
  76.  
  77.       Writeln(GetTickCount64 - Tick, 'ms');
  78.  
  79.       assert(V.X = 1*4);
  80.       assert(V.Y = 2*5);
  81.       assert(V.Z = 3*6);
  82.  
  83.       write ('fpc procedure: ');
  84.       Tick := GetTickCount64;
  85.       for I := 0 to 99999999 do
  86.         MultPr(V1, V2, V);
  87.  
  88.       Writeln(GetTickCount64 - Tick, 'ms');
  89.  
  90.       assert(V.X = 1*4);
  91.       assert(V.Y = 2*5);
  92.       assert(V.Z = 3*6);
  93.  
  94.       Write('Hand code function: ');
  95.       Tick := GetTickCount64;
  96.       for I := 0 to 99999999 do
  97.         V := Mult_SSE(V1, V2);
  98.       Writeln(GetTickCount64 - Tick, 'ms');
  99.  
  100.       assert(V.X = 1*4);
  101.       assert(V.Y = 2*5);
  102.       assert(V.Z = 3*6);
  103.  
  104.       Write('Hand code procedure: ');
  105.       Tick := GetTickCount64;
  106.       for I := 0 to 99999999 do
  107.         Mult_SSEPr(V1, V2, v);
  108.       Writeln(GetTickCount64 - Tick, 'ms');
  109.  
  110.       assert(V.X = 1*4);
  111.       assert(V.Y = 2*5);
  112.       assert(V.Z = 3*6);
  113.     end.

output:
Size of TVector3 12 bytes
fpc function: 817ms
fpc procedure: 128ms
Hand code function: 185ms
Hand code procedure: 139ms

and for TVector3 of 4 Single
 
Code: Pascal  [Select][+][-]
  1.     type
  2.       TVector3 = record
  3.         X, Y, Z, A: Single;
  4.       end;

output:
Size of TVector3 16 bytes
fpc function: 1164ms
fpc procedure: 161ms
Hand code function: 945ms
Hand code procedure: 116ms

In my tests function has way worse results if TVector was 4 singles instead of 3.
« Last Edit: March 09, 2021, 02:54:23 pm by mika »

 

TinyPortal © 2005-2018