Recent

Author Topic: x64 32bit floating point operations is slower than x86  (Read 336 times)

kagamma

  • New member
  • *
  • Posts: 8
x64 32bit floating point operations is slower than x86
« on: January 19, 2021, 03:56:43 am »
So I decide to do a small benchmark on floating point operations and notice it's performance is worse when compile for x86_64 target.

Specs:
OS: Windows 10
CPU: Intel Core i5-9400F
FPC version: 3.2.0-r45643

command line (x64): fpc -Twin64 -Px86_64 -CfSSE42 -CpCOREI -O4 -Sv test.pas
command line (x86): fpc -CfSSE42 -CpCOREI -O4 -Sv test.pas

Code:
Code: Pascal  [Select][+][-]
  1. {$mode objfpc}
  2. {$asmmode intel}
  3. {$assertions on}
  4.  
  5. uses
  6.   SysUtils;
  7.  
  8. type
  9.   TVector3 = record
  10.     X, Y, Z: Single;
  11.   end;
  12.  
  13. function Mult(const V1, V2: TVector3): TVector3;
  14. begin
  15.   Result.X := V1.X * V2.X;
  16.   Result.Y := V1.Y * V2.Y;
  17.   Result.Z := V1.Z * V2.Z;
  18. end;
  19.  
  20. function Mult_SSE(constref V1, V2: TVector3): TVector3; assembler; nostackframe;
  21. asm
  22.   movhps xmm0,qword ptr [V1 + 4]
  23.   movlps xmm0,qword ptr [V1]
  24.   movhps xmm1,qword ptr [V2 + 4]
  25.   movlps xmm1,qword ptr [V2]
  26.   mulps  xmm0,xmm1
  27.   movhps qword ptr [Result + 4],xmm0
  28.   movlps qword ptr [Result],xmm0
  29. end;
  30.  
  31. var
  32.   V, V1, V2: TVector3;
  33.   I, Tick: Integer;
  34.  
  35. begin
  36.   V1.X := 1; V1.Y := 2; V1.Z := 3;
  37.   V2.X := 4; V2.Y := 5; V2.Z := 6;
  38.   Write('fpc: ');
  39.   Tick := GetTickCount64;
  40.   for I := 0 to 99999999 do
  41.     V := Mult(V1, V2);
  42.   assert(V.X = 1*4);
  43.   assert(V.Y = 2*5);
  44.   assert(V.Z = 3*6);
  45.   Writeln(GetTickCount64 - Tick, 'ms');
  46.   Write('Hand code: ');
  47.   Tick := GetTickCount64;
  48.   for I := 0 to 99999999 do
  49.     V := Mult_SSE(V1, V2);
  50.   Writeln(GetTickCount64 - Tick, 'ms');
  51.   assert(V.X = 1*4);
  52.   assert(V.Y = 2*5);
  53.   assert(V.Z = 3*6);
  54. end.

Result:
- fpc (x86): ~203ms
- fpc (x64): ~500ms
- Hand code (x64): ~156ms

Notice the x86 fpc result is quite close to hand code SSE2 version of Mult function, whenever the x64 fpc result is, well  :-[

Edit: After looking at the asm output, x86 and x64 generate the same assembly code. Further more if I convert TVector3 to TVector4, then x86 and x64 have the same speed. So I guess it's more about memory access performance than the code itself.
« Last Edit: January 19, 2021, 05:26:05 am by kagamma »

 

TinyPortal © 2005-2018