Recent

Author Topic: Benchmark regular vs SIMD vs constref  (Read 413 times)

LemonParty

  • Hero Member
  • *****
  • Posts: 530
Benchmark regular vs SIMD vs constref
« on: May 21, 2026, 02:42:35 pm »
I decided to write a small banchmark to check what is faster:
1. Regular min function;
2. Inline min function;
3. SIMD min function;
4. Constref inline min function;
5. Constref SIMD min function.

The benchmark:
Code: Pascal  [Select][+][-]
  1. {$mode objfpc}{$H+}
  2. {$inline on}
  3. {$If Defined(CPU386) OR Defined(CPUX64)}
  4.   {$ASMMODE intel}
  5. {$EndIf}
  6. {$SMARTLINK ON}
  7. {$Calling Register}
  8.  
  9. uses stopwatch;
  10.  
  11. function MinRegular(A, B, C: Integer): Integer;
  12. begin
  13.   if A < B then
  14.     Result:= A;
  15.   if C < Result then
  16.     Result:= C;
  17. end;
  18.  
  19. function MinInline(A, B, C: Integer): Integer;inline;
  20. begin
  21.   if A < B then
  22.     Result:= A;
  23.   if C < Result then
  24.     Result:= C;
  25. end;
  26.  
  27. function MinSIMD(A, B, C: Integer): Integer;assembler;nostackframe;
  28. asm
  29.   movD xmm0,A
  30.   movD xmm1,B
  31.   movD xmm2,C
  32.   pminSD xmm0,xmm1
  33.   pminSD xmm0,xmm2
  34.   movD eax,xmm0
  35. end;
  36.  
  37. function crMinInline(constref A, B, C: Integer): Integer;inline;
  38. begin
  39.   if A < B then
  40.     Result:= A;
  41.   if C < Result then
  42.     Result:= C;
  43. end;
  44.  
  45. function crMinSIMD(constref A, B, C: Integer): Integer;assembler;nostackframe;
  46. asm
  47.   movD xmm0,dword ptr[A]
  48.   movD xmm1,dword ptr[B]
  49.   movD xmm2,dword ptr[C]
  50.   pminSD xmm0,xmm1
  51.   pminSD xmm0,xmm2
  52.   movD eax,xmm0
  53. end;
  54.  
  55. const
  56.   CSize = 1024 * 1024;
  57. var
  58.   i: SizeInt;
  59.   sw: TStopWatch;
  60.   pL: PLongInt;
  61.   L: LongInt;
  62. begin
  63.   sw:= TStopWatch.Create;
  64.  
  65.   pL:= GetMem(CSize * SizeOf(LongInt));
  66.  
  67.   for i:= 0 to CSize - 1 do
  68.     pL[i]:= Random(10) - 5;
  69.  
  70.   sw.Reset; sw.Start;
  71.   for i:= 0 to CSize - 3 do
  72.     L:= MinRegular(pL[i], pL[i+1], pL[i+2]);
  73.   sw.Stop;
  74.   Writeln('Regular    : ', sw.ElapsedTicks);
  75.  
  76.   sw.Reset; sw.Start;
  77.   for i:= 0 to CSize - 3 do
  78.     L:= MinInline(pL[i], pL[i+1], pL[i+2]);
  79.   sw.Stop;
  80.   Writeln('Inline     : ', sw.ElapsedTicks);
  81.  
  82.   sw.Reset; sw.Start;
  83.   for i:= 0 to CSize - 3 do
  84.     L:= MinSIMD(pL[i], pL[i+1], pL[i+2]);
  85.   sw.Stop;
  86.   Writeln('SIMD       : ', sw.ElapsedTicks);
  87.  
  88.   sw.Reset; sw.Start;
  89.   for i:= 0 to CSize - 3 do
  90.     L:= crMinInline(pL[i], pL[i+1], pL[i+2]);
  91.   sw.Stop;
  92.   Writeln('crInline   : ', sw.ElapsedTicks);
  93.  
  94.   sw.Reset; sw.Start;
  95.   for i:= 0 to CSize - 3 do
  96.     L:= crMinSIMD(pL[i], pL[i+1], pL[i+2]);
  97.   sw.Stop;
  98.   Writeln('crSIMD     : ', sw.ElapsedTicks);
  99. end.
Results supriced me.
Here the results of executing the benchmark:
Quote
Regular    : 9904
Inline     : 4978
SIMD       : 14428
crInline   : 64643
crSIMD     : 13974

Regular    : 9471
Inline     : 7020
SIMD       : 14831
crInline   : 64921
crSIMD     : 12251

Regular    : 11907
Inline     : 7517
SIMD       : 14944
crInline   : 63790
crSIMD     : 14353

Regular    : 14673
Inline     : 4474
SIMD       : 14399
crInline   : 68077
crSIMD     : 15249

Regular    : 9180
Inline     : 4913
SIMD       : 14527
crInline   : 62653
crSIMD     : 17705
Conclusions:
1. Simplest inline 2 times faster then anything else.
2. The second by speed is regular min function, which is 2 times slower than inline.
3. SIMD version is 2.5-3 times slower then inline.
4. Constref SIMD function is faster then regular SIMD function. That is strange because code that reads from memory by idea should be slower.
5. Constref inline – this is a disaster. 12 (!) times slower then simplest inline. Don't use constref with inline.
Lazarus v. 4.99. FPC v. 3.3.1. Windows 11

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 12409
  • Debugger - SynEdit - and more
    • wiki
Re: Benchmark regular vs SIMD vs constref
« Reply #1 on: May 21, 2026, 03:01:07 pm »
Just to mention: the code for the loop itself (without payload) may run at very different speed, depending on its align.

Use $CODEALIGN  to align loops at 32 bytes.

Or pack each loop into a proc of its own, and align all procs at 32 bytes.
(because otherwise FPC's register allocator may treat them different, as it may have memory of the usage in the loops before)

You could also artificiality increase the code size of the loop by putting some extra stuff into it. Currently your loops are "benchmark loops". Tiny loops that likely want occur like this in real live, but that are sometimes highly optimized inside the CPU.


Also align all procs at 32 bytes anyway, in case that affect the speed.
Also compare result, if you re-order the tests.



I don't know what you expect from "constref" => hiding an integer with a pointer is an obvious slow-down. (prevents register usage, and forces values into memory)




And sum up all the returned "L" values, and use the result.

I don't know for sure, but if you compile with enough optimization (DFA) then maybe fpc realizes its not used. => and if that happens, at the very least the 2nd "if" in the inlined regular function becomes dead code, that might also be removed.

But then again, don't know if FPC will or will not do that.

LemonParty

  • Hero Member
  • *****
  • Posts: 530
Re: Benchmark regular vs SIMD vs constref
« Reply #2 on: May 21, 2026, 03:56:42 pm »
Retested with
Code: Pascal  [Select][+][-]
  1. {$CODEALIGN PROC=32}
  2. {$CODEALIGN LOOP=32}
Quote
Regular    : 15163
Inline     : 5811
SIMD       : 14406
crInline   : 59455
crSIMD     : 10864

Regular    : 9274
Inline     : 5023
SIMD       : 14737
crInline   : 59480
crSIMD     : 14515

Regular    : 9668
Inline     : 6494
SIMD       : 15171
crInline   : 59813
crSIMD     : 11799

Regular    : 11279
Inline     : 5148
SIMD       : 14397
crInline   : 61215
crSIMD     : 14339

Regular    : 15199
Inline     : 8379
SIMD       : 16938
crInline   : 62300
crSIMD     : 11946
As you can see the results not differ a lot. I tested $CODEALIGN with different project and didn't found any difference in performance. I have a theory that this two directives are valuable on older CPUs or code that don't fit into the cache.

I can add payload to loops and in theory this can force compiler to use stack for variables. But I consider this tests unfair because beside "clear" results of test we will be testing ability of compiler to juggle with registers and this is two different things.

By the way, forget to mention that benchmark compiled with O4.
Lazarus v. 4.99. FPC v. 3.3.1. Windows 11

Thaddy

  • Hero Member
  • *****
  • Posts: 19268
  • Glad to be alive.
Re: Benchmark regular vs SIMD vs constref
« Reply #3 on: May 21, 2026, 04:14:01 pm »
Using asm blocks does never speed up things based on just alignment.
You should look at the generated code in pure pascal only. (-al )

The basm trap....

First think about why you use assembler, then check if you need it and remember you will loose all platform compatibility. It takes only little brains.
There are  specialized cases when it still makes sense, but these are few.
« Last Edit: May 21, 2026, 05:06:17 pm by Thaddy »
objects are fine constructs. You can even initialize them with constructors.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 12409
  • Debugger - SynEdit - and more
    • wiki
Re: Benchmark regular vs SIMD vs constref
« Reply #4 on: May 21, 2026, 04:35:54 pm »
I did a few runs of the code from your initial post: i7-8700K

Using fpc 3.2.3 and 3.3.1 (both about a month old)
** EDIT: O4 / but forgot to disable overflow/range checks / anyway: same for all

I wrapped all tests into an outer loop to get several runs of each loop.

Code: Text  [Select][+][-]
  1. 3.2.3
  2.  
  3. Regular: 20678  | Inline: 11730  | SIMD: 23480  | crInline: 51540  | crSIMD: 19014
  4. Regular: 21578  | Inline: 9815  | SIMD: 23556  | crInline: 51380  | crSIMD: 19104
  5. Regular: 21515  | Inline: 9928  | SIMD: 23392  | crInline: 51444  | crSIMD: 16531
  6. Regular: 21098  | Inline: 9396  | SIMD: 23638  | crInline: 51366  | crSIMD: 17960
  7. Regular: 21058  | Inline: 9434  | SIMD: 23470  | crInline: 51410  | crSIMD: 18472
  8. Regular: 21288  | Inline: 12284  | SIMD: 23344  | crInline: 51237  | crSIMD: 18888
  9. Regular: 21554  | Inline: 9920  | SIMD: 23432  | crInline: 51162  | crSIMD: 17573
  10. Regular: 21096  | Inline: 11271  | SIMD: 23736  | crInline: 51286  | crSIMD: 18010
  11. Regular: 21041  | Inline: 9534  | SIMD: 23538  | crInline: 51621  | crSIMD: 18974
  12. Regular: 21229  | Inline: 9977  | SIMD: 23337  | crInline: 51453  | crSIMD: 16693
  13.  
  14.  
  15.  {$CodeAlign loop=32}
  16.  {$CodeAlign proc=32}
  17. Regular: 14834  | Inline: 9750  | SIMD: 14994  | crInline: 43819  | crSIMD: 19443
  18. Regular: 14213  | Inline: 8868  | SIMD: 14129  | crInline: 43254  | crSIMD: 19754
  19. Regular: 14570  | Inline: 8977  | SIMD: 14145  | crInline: 43204  | crSIMD: 18707
  20. Regular: 14410  | Inline: 9171  | SIMD: 14289  | crInline: 43293  | crSIMD: 19551
  21. Regular: 14495  | Inline: 8874  | SIMD: 14130  | crInline: 43328  | crSIMD: 22109
  22. Regular: 14982  | Inline: 9687  | SIMD: 14466  | crInline: 43358  | crSIMD: 18933
  23. Regular: 14688  | Inline: 8940  | SIMD: 14450  | crInline: 43543  | crSIMD: 19914
  24. Regular: 14776  | Inline: 8987  | SIMD: 14109  | crInline: 43387  | crSIMD: 19755
  25. Regular: 14714  | Inline: 9491  | SIMD: 14528  | crInline: 43304  | crSIMD: 18753
  26. Regular: 14123  | Inline: 8859  | SIMD: 14113  | crInline: 45810  | crSIMD: 19768
  27.  
  28. ---------------------------------------------------------------------
  29.  
  30. 3.3.1
  31. Regular: 19890  | Inline: 13821  | SIMD: 29349  | crInline: 40881  | crSIMD: 21268
  32. Regular: 28364  | Inline: 10512  | SIMD: 21550  | crInline: 40747  | crSIMD: 21118
  33. Regular: 19392  | Inline: 11322  | SIMD: 21870  | crInline: 41066  | crSIMD: 21568
  34. Regular: 19098  | Inline: 11134  | SIMD: 21011  | crInline: 41023  | crSIMD: 21683
  35. Regular: 18342  | Inline: 11229  | SIMD: 21008  | crInline: 40791  | crSIMD: 21455
  36. Regular: 19971  | Inline: 11056  | SIMD: 21790  | crInline: 41027  | crSIMD: 20746
  37. Regular: 18721  | Inline: 10267  | SIMD: 21055  | crInline: 41057  | crSIMD: 21794
  38. Regular: 19703  | Inline: 10043  | SIMD: 21133  | crInline: 40714  | crSIMD: 21219
  39. Regular: 19908  | Inline: 10389  | SIMD: 21654  | crInline: 40233  | crSIMD: 19953
  40. Regular: 18733  | Inline: 9373  | SIMD: 21842  | crInline: 40763  | crSIMD: 21774
  41.  
  42.  {$CodeAlign loop=32}
  43.  {$CodeAlign proc=32}
  44. Regular: 17047  | Inline: 14625  | SIMD: 20181  | crInline: 34750  | crSIMD: 20127
  45. Regular: 16712  | Inline: 14176  | SIMD: 18893  | crInline: 34183  | crSIMD: 21528
  46. Regular: 17190  | Inline: 14536  | SIMD: 18819  | crInline: 34436  | crSIMD: 19536
  47. Regular: 16476  | Inline: 24434  | SIMD: 19949  | crInline: 35120  | crSIMD: 20650
  48. Regular: 16669  | Inline: 14193  | SIMD: 19134  | crInline: 40828  | crSIMD: 21218
  49. Regular: 17122  | Inline: 14266  | SIMD: 18745  | crInline: 33824  | crSIMD: 21272
  50. Regular: 17641  | Inline: 14815  | SIMD: 19955  | crInline: 34882  | crSIMD: 19266
  51. Regular: 16461  | Inline: 14117  | SIMD: 18729  | crInline: 40766  | crSIMD: 21625
  52. Regular: 17087  | Inline: 14423  | SIMD: 18850  | crInline: 33709  | crSIMD: 19727
  53. Regular: 20145  | Inline: 14599  | SIMD: 19955  | crInline: 34380  | crSIMD: 19944
  54.  
  55.  
  56.  
  57. 3.3.1  (swapped 1 and 2) // NO codealign // compare 2 above
  58.  
  59.   | Inline: 14573 | Regular: 19866  | SIMD: 22257  | crInline: 40996  | crSIMD: 21712
  60.   | Inline: 14503 | Regular: 18826  | SIMD: 21027  | crInline: 40886  | crSIMD: 21792
  61.   | Inline: 14642 | Regular: 19279  | SIMD: 21038  | crInline: 40592  | crSIMD: 20421
  62.   | Inline: 14727 | Regular: 19968  | SIMD: 21780  | crInline: 41072  | crSIMD: 21125
  63.   | Inline: 14077 | Regular: 18729  | SIMD: 21504  | crInline: 40891  | crSIMD: 20453
  64.   | Inline: 14620 | Regular: 19438  | SIMD: 21153  | crInline: 40511  | crSIMD: 22407
  65.   | Inline: 14746 | Regular: 19910  | SIMD: 21115  | crInline: 40468  | crSIMD: 20482
  66.   | Inline: 14649 | Regular: 20250  | SIMD: 21806  | crInline: 41191  | crSIMD: 21343
  67.   | Inline: 14053 | Regular: 18698  | SIMD: 21122  | crInline: 41151  | crSIMD: 21239
  68.   | Inline: 14463 | Regular: 18809  | SIMD: 21015  | crInline: 40666  | crSIMD: 21940
  69.  

There are differences with code align. Not in all columns, some may not be affected, or maybe were aligned well before, or .... Also did not check which of the 2 aligns did how much impact.

There are also differences if I just change the order of the methods. (I only swapped inline and regular, but the "inline" changed timings, similar to align changes).
While that probably is just the align change: I didn't check what causes the diff when I changed the order. That is, I didn't check the generated asm: Is it the same asm, or did FPC change the asm, because other code was in front? Don't know.

----

And yes, each CPU version may react different to code align...


Btw the 32 byte align isn't the regular cache. Its some internal cache for translated micro-opcodes... (or whatever they are named)

« Last Edit: May 21, 2026, 04:46:37 pm by Martin_fr »

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 12409
  • Debugger - SynEdit - and more
    • wiki
Re: Benchmark regular vs SIMD vs constref
« Reply #5 on: May 21, 2026, 04:39:43 pm »
I just noticed... (but I will not run them again): It should run with a fixed random seed. Otherwise the timing may vary because of that too.

ALLIGATOR

  • Sr. Member
  • ****
  • Posts: 437
  • I use FPC [main] 💪🐯💪
Re: Benchmark regular vs SIMD vs constref
« Reply #6 on: May 21, 2026, 04:50:13 pm »
Your comparison is flawed. What if B is the minimum value?

I also think your expectations are incorrect.

SIMD involves a single instruction processing multiple data points—meaning you process several values at once in a single call, but in roughly the same amount of time (roughly speaking).

In your case, both SIMD and regular comparison operations are executed the same number of times. That is, to see a performance boost, you need to load 4+4 values at a time and compare 4 at once.

Why are regular instructions faster? Most likely because, at the architectural level, the processor has more comparison units for regular registers than for vector registers.
I may seem rude - please don't take it personally

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 12409
  • Debugger - SynEdit - and more
    • wiki
Re: Benchmark regular vs SIMD vs constref
« Reply #7 on: May 21, 2026, 04:50:30 pm »
Code: Pascal  [Select][+][-]
  1. [quote]    function MinRegular(A, B, C: Integer): Integer;
  2.     begin
  3.       if A < B then
  4.         Result:= A;
  5.       if C < Result then
  6.         Result:= C;
  7.     end;[/quote]

You never do
Code: Pascal  [Select][+][-]
  1.   Result := B;

ALLIGATOR

  • Sr. Member
  • ****
  • Posts: 437
  • I use FPC [main] 💪🐯💪
Re: Benchmark regular vs SIMD vs constref
« Reply #8 on: May 21, 2026, 05:01:24 pm »
I also believe that if you manually unroll the loop (by a factor of 2, 3, or 4) for standard calculations, you can achieve a further increase in speed
« Last Edit: May 21, 2026, 05:03:23 pm by ALLIGATOR »
I may seem rude - please don't take it personally

ASerge

  • Hero Member
  • *****
  • Posts: 2497
Re: Benchmark regular vs SIMD vs constref
« Reply #9 on: May 21, 2026, 06:06:14 pm »
Correct code is
Code: Pascal  [Select][+][-]
  1. function MinRegular(A, B, C: Integer): Integer;
  2. begin
  3.   Result := A;
  4.   if B < Result then
  5.     Result := B;
  6.   if C < Result then
  7.     Result := C;
  8. end;

But MinInline is still faster because it never jumps:
Code: ASM  [Select][+][-]
  1. # [17] begin
  2. # Var A located in register ecx
  3. # Var B located in register edx
  4. # Var C located in register r8d
  5. # Var $result located in register eax
  6.         movl    %ecx,%eax
  7.         cmpl    %edx,%eax
  8.         cmovgl  %edx,%eax
  9.         cmpl    %r8d,%eax
  10.         cmovgl  %r8d,%eax
  11.         ret
  12.  

LemonParty

  • Hero Member
  • *****
  • Posts: 530
Re: Benchmark regular vs SIMD vs constref
« Reply #10 on: May 21, 2026, 08:11:24 pm »
Sorry. I corrected the benchmark and added fixed random seed:
Code: Pascal  [Select][+][-]
  1. {$mode objfpc}{$H+}
  2. {$inline on}
  3. {$If Defined(CPU386) OR Defined(CPUX64)}
  4.   {$ASMMODE intel}
  5. {$EndIf}
  6. {$SMARTLINK ON}
  7. {$Calling Register}
  8. {$CODEALIGN PROC=32}
  9. {$CODEALIGN LOOP=32}
  10.  
  11.  
  12. uses stopwatch;
  13.  
  14. function MinRegular(A, B, C: Integer): Integer;
  15. begin
  16.   Result := A;
  17.   if B < Result then
  18.     Result := B;
  19.   if C < Result then
  20.     Result := C;
  21. end;
  22.  
  23. function MinInline(A, B, C: Integer): Integer;inline;
  24. begin
  25.   Result := A;
  26.   if B < Result then
  27.     Result := B;
  28.   if C < Result then
  29.     Result := C;
  30. end;
  31.  
  32. function MinSIMD(A, B, C: Integer): Integer;assembler;nostackframe;
  33. asm
  34.   movD xmm0,A
  35.   movD xmm1,B
  36.   movD xmm2,C
  37.   pminSD xmm0,xmm1
  38.   pminSD xmm0,xmm2
  39.   movD eax,xmm0
  40. end;
  41.  
  42. function crMinInline(constref A, B, C: Integer): Integer;inline;
  43. begin
  44.   Result := A;
  45.   if B < Result then
  46.     Result := B;
  47.   if C < Result then
  48.     Result := C;
  49. end;
  50.  
  51. function crMinSIMD(constref A, B, C: Integer): Integer;assembler;nostackframe;
  52. asm
  53.   movD xmm0,dword ptr[A]
  54.   movD xmm1,dword ptr[B]
  55.   movD xmm2,dword ptr[C]
  56.   pminSD xmm0,xmm1
  57.   pminSD xmm0,xmm2
  58.   movD eax,xmm0
  59. end;
  60.  
  61. const
  62.   CSize = 1024 * 1024;
  63. var
  64.   i: SizeInt;
  65.   sw: TStopWatch;
  66.   pL: PLongInt;
  67.   L: LongInt;
  68. begin
  69.   sw:= TStopWatch.Create;
  70.        
  71.         RandSeed:= 42;
  72.        
  73.   pL:= GetMem(CSize * SizeOf(LongInt));
  74.  
  75.   for i:= 0 to CSize - 1 do
  76.     pL[i]:= Random(10) - 5;
  77.  
  78.   sw.Reset; sw.Start;
  79.   for i:= 0 to CSize - 3 do
  80.     L:= MinRegular(pL[i], pL[i+1], pL[i+2]);
  81.   sw.Stop;
  82.   Writeln('Regular    : ', sw.ElapsedTicks);
  83.  
  84.   sw.Reset; sw.Start;
  85.   for i:= 0 to CSize - 3 do
  86.     L:= MinInline(pL[i], pL[i+1], pL[i+2]);
  87.   sw.Stop;
  88.   Writeln('Inline     : ', sw.ElapsedTicks);
  89.  
  90.   sw.Reset; sw.Start;
  91.   for i:= 0 to CSize - 3 do
  92.     L:= MinSIMD(pL[i], pL[i+1], pL[i+2]);
  93.   sw.Stop;
  94.   Writeln('SIMD       : ', sw.ElapsedTicks);
  95.  
  96.   sw.Reset; sw.Start;
  97.   for i:= 0 to CSize - 3 do
  98.     L:= crMinInline(pL[i], pL[i+1], pL[i+2]);
  99.   sw.Stop;
  100.   Writeln('crInline   : ', sw.ElapsedTicks);
  101.  
  102.   sw.Reset; sw.Start;
  103.   for i:= 0 to CSize - 3 do
  104.     L:= crMinSIMD(pL[i], pL[i+1], pL[i+2]);
  105.   sw.Stop;
  106.   Writeln('crSIMD     : ', sw.ElapsedTicks);
  107. end.

Results:
Code: Pascal  [Select][+][-]
  1. Regular    : 11133
  2. Inline     : 10457
  3. SIMD       : 15068
  4. crInline   : 9728
  5. crSIMD     : 14552
  6.  
  7. Regular    : 11679
  8. Inline     : 7302
  9. SIMD       : 14397
  10. crInline   : 9635
  11. crSIMD     : 10974
  12.  
  13. Regular    : 11320
  14. Inline     : 7192
  15. SIMD       : 15843
  16. crInline   : 9821
  17. crSIMD     : 14348
  18.  
  19. Regular    : 10247
  20. Inline     : 6817
  21. SIMD       : 15141
  22. crInline   : 9279
  23. crSIMD     : 11684
  24.  
  25. Regular    : 10251
  26. Inline     : 10816
  27. SIMD       : 15409
  28. crInline   : 9544
  29. crSIMD     : 14560
The difference between simple inline and others not big now. And constref inline now for some reason not giant any more.

Quote
I also believe that if you manually unroll the loop (by a factor of 2, 3, or 4) for standard calculations, you can achieve a further increase in speed
Yes, unrolling loops will definetly will bring more performance. But I start benchmark with a case where unrolling is problematic. I probably will write a benchmark where I operate with packed data. Then we will see.
« Last Edit: May 21, 2026, 08:14:00 pm by LemonParty »
Lazarus v. 4.99. FPC v. 3.3.1. Windows 11

LemonParty

  • Hero Member
  • *****
  • Posts: 530
Re: Benchmark regular vs SIMD vs constref
« Reply #11 on: May 21, 2026, 08:18:51 pm »
But MinInline is still faster because it never jumps:
Code: ASM  [Select][+][-]
  1. # [17] begin
  2. # Var A located in register ecx
  3. # Var B located in register edx
  4. # Var C located in register r8d
  5. # Var $result located in register eax
  6.         movl    %ecx,%eax
  7.         cmpl    %edx,%eax
  8.         cmovgl  %edx,%eax
  9.         cmpl    %r8d,%eax
  10.         cmovgl  %r8d,%eax
  11.         ret
  12.  
Ha, I suppouse that compiler will generate code with jumps. But compiler turned out to be smart.
Lazarus v. 4.99. FPC v. 3.3.1. Windows 11

creaothceann

  • Sr. Member
  • ****
  • Posts: 375
Re: Benchmark regular vs SIMD vs constref
« Reply #12 on: May 21, 2026, 09:58:05 pm »
The data (1 MiB) is larger than most CPU's L1 cache (32 or 64 KiB), so you're partially testing the L2 access speed. If you want to test the raw speed of the instructions, restrict the data to 32 KiB.

Then, go through the array multiple times instead of just once.

Code: Pascal  [Select][+][-]
  1. program Test_Min3;
  2.  
  3.  
  4. {$mode ObjFPC}
  5. {$H+}
  6. {$inline on}
  7.  
  8. {$if defined(CPU386) or defined(CPUx64)}
  9.         {$AsmMode Intel}
  10. {$endif}
  11.  
  12. {$SmartLink on}
  13. {$Calling Register}
  14. {$CodeAlign PROC=32}
  15. {$CodeAlign LOOP=32}
  16.  
  17. uses
  18.         U_HRT;
  19.  
  20.  
  21. function MinRegular(A, B, C : integer) : integer;
  22. begin
  23.         ;                    Result := A;
  24.         if (B < Result) then Result := B;
  25.         if (C < Result) then Result := C;
  26. end;
  27.  
  28.  
  29. function MinInline(A, B, C : integer) : integer;  Inline;
  30. begin
  31.         ;                    Result := A;
  32.         if (B < Result) then Result := B;
  33.         if (C < Result) then Result := C;
  34. end;
  35.  
  36.  
  37. function MinSIMD(A, B, C : integer) : integer;  Assembler;  NoStackFrame;
  38. asm
  39.         movD    xmm0, A
  40.         movD    xmm1, B
  41.         movD    xmm2, C
  42.         pminSD  xmm0, xmm1
  43.         pminSD  xmm0, xmm2
  44.         movD    eax,  xmm0
  45. end;
  46.  
  47.  
  48. function crMinInline(constref A, B, C : integer) : integer;  Inline;
  49. begin
  50.         ;                    Result := A;
  51.         if (B < Result) then Result := B;
  52.         if (C < Result) then Result := C;
  53. end;
  54.  
  55.  
  56. function crMinSIMD(constref A, B, C : integer) : integer;  Assembler;  NoStackFrame;
  57. asm
  58.         movD    xmm0, dword ptr[A]
  59.         movD    xmm1, dword ptr[B]
  60.         movD    xmm2, dword ptr[C]
  61.         pminSD  xmm0, xmm1
  62.         pminSD  xmm0, xmm2
  63.         movD    eax,  xmm0
  64. end;
  65.  
  66.  
  67. //------------------------------------------------------------------------------
  68.  
  69.  
  70. const
  71.         KiB = 1024;
  72. //      MiB = 1024 * KiB;
  73. //      GiB = 1024 * MiB;
  74. //      TiB = 1024 * GiB;
  75.  
  76.         CSize      = 32 * KiB;
  77.         Iterations = 100 * 1000;
  78.  
  79. var
  80.         i, j :  SizeInt;
  81.         pL   : PLongInt;
  82.         L    :  LongInt;
  83.  
  84. begin
  85.         RandSeed := 42;
  86.  
  87.         pL := GetMem(CSize);
  88.         for i := 0 to CSize - 1 do  pL[i] := Random(10) - 5;
  89.  
  90.         Clock.Start;  for j := 1 to Iterations do  for i := 0 to (CSize - 3) do  L := MinRegular (pL[i], pL[i + 1], pL[i + 2]);  Clock.Stop;  Writeln('Regular  = ', Clock.Delta:1:3, ' seconds');
  91.         Clock.Start;  for j := 1 to Iterations do  for i := 0 to (CSize - 3) do  L := MinInline  (pL[i], pL[i + 1], pL[i + 2]);  Clock.Stop;  Writeln('Inline   = ', Clock.Delta:1:3, ' seconds');
  92.         Clock.Start;  for j := 1 to Iterations do  for i := 0 to (CSize - 3) do  L := MinSIMD    (pL[i], pL[i + 1], pL[i + 2]);  Clock.Stop;  Writeln('SIMD     = ', Clock.Delta:1:3, ' seconds');
  93.         Clock.Start;  for j := 1 to Iterations do  for i := 0 to (CSize - 3) do  L := crMinInline(pL[i], pL[i + 1], pL[i + 2]);  Clock.Stop;  Writeln('crInline = ', Clock.Delta:1:3, ' seconds');
  94.         Clock.Start;  for j := 1 to Iterations do  for i := 0 to (CSize - 3) do  L := crMinSIMD  (pL[i], pL[i + 1], pL[i + 2]);  Clock.Stop;  Writeln('crSIMD   = ', Clock.Delta:1:3, ' seconds');
  95.         WriteLn;
  96.         WriteLn('done');
  97.         ReadLn;
  98.         if (L = 0) then ;  // prevent compiler warning
  99. end.

Code: Pascal  [Select][+][-]
  1. unit U_HRT;
  2.  
  3.  
  4. // Clock based on the system-wide high-resolution timer.
  5.  
  6.  
  7. {$ModeSwitch AdvancedRecords}
  8.  
  9. interface  /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
  10. uses
  11.         {$ifdef Unix} Linux, UnixType, {$endif}
  12.         SysUtils;
  13.  
  14. type
  15.         Clock = record
  16.                 type
  17.                         Seconds = Double;
  18.                         Time    = Int64;
  19.  
  20.                 class function GetTime         : Time;             static;
  21.                 class function Convert(const t : Time) : Seconds;  static;  inline;
  22.  
  23.                 class procedure Start;            static;  inline;
  24.                 class procedure Stop;             static;  inline;
  25.                 class function  Delta : Seconds;  static;  inline;
  26.  
  27.                 private
  28.  
  29.                 class var
  30.                         _InternalCounter : Time;
  31.                         _TicksPerSecond  : Int64;
  32.  
  33.                 class procedure _Init;  inline;  static;
  34.  
  35.                 public
  36.  
  37.                 class property Resolution : Int64 read _TicksPerSecond;
  38.                 end;
  39.  
  40.  
  41. implementation  ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
  42.  
  43.  
  44. {$ifdef Windows}
  45. function QueryPerformanceCounter  (out i : Clock.Time) : LongBool;  external 'kernel32' name 'QueryPerformanceCounter';
  46. function QueryPerformanceFrequency(out i : Clock.Time) : LongBool;  external 'kernel32' name 'QueryPerformanceFrequency';
  47. {$endif}
  48.  
  49.  
  50. {$ifdef Unix}
  51. function QueryPerformanceCounter  (out i : Clock.Time) : LongBool;  inline;  var t : TimeSpec;  begin  Result := (Clock_GetTime(Clock_Monotonic, @t) >= 0);  if Result then i := t.TV_nsec;  end;
  52. function QueryPerformanceFrequency(out i : Clock.Time) : LongBool;  inline;  var t : TimeSpec;  begin  Result := (Clock_GetRes (Clock_Monotonic, @t) >= 0);  if Result then i := t.TV_nsec;  end;
  53. {$endif}
  54.  
  55.  
  56. class procedure Clock._Init;           inline;  begin  if not QueryPerformanceFrequency(_TicksPerSecond) then raise Exception.Create('could not get clock resolution');  end;
  57. class function  Clock.GetTime : Time;           begin  if not QueryPerformanceCounter  (Result         ) then raise Exception.Create('could not get clock tick'      );  end;
  58.  
  59.  
  60. class procedure Clock.Start;                              inline;  begin  _InternalCounter := GetTime;                     end;
  61. class procedure Clock.Stop;                               inline;  begin  _InternalCounter := GetTime - _InternalCounter;  end;
  62. class function  Clock.Convert(const t : Time) : Seconds;  inline;  begin  Result           := t       / _TicksPerSecond;   end;
  63. class function  Clock.Delta                   : Seconds;  inline;  begin  Result           := Convert(_InternalCounter);   end;
  64.  
  65.  
  66. ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
  67.  
  68.  
  69. initialization
  70.         Clock._Init;
  71.  
  72.  
  73. end.

Code: [Select]
Regular  = 3.442 seconds
Inline   = 1.428 seconds
SIMD     = 4.771 seconds
crInline = 2.514 seconds
crSIMD   = 4.100 seconds

done

(Ryzen 7 7800X3D, AVX2, 32 KiB L1 cache + 1 MiB L2 cache per core)
« Last Edit: May 21, 2026, 10:12:12 pm by creaothceann »

MathMan

  • Hero Member
  • *****
  • Posts: 516
Re: Benchmark regular vs SIMD vs constref
« Reply #13 on: May 21, 2026, 10:27:48 pm »
SIMD is supposed to handle multiple data under the same instruction (as explained before) - so in a sense one could say 'you are holding it wrong'  ;) Take a look

Code: Pascal  [Select][+][-]
  1. {$mode objfpc}{$H+}
  2. {$inline on}
  3. {$If Defined(CPU386) OR Defined(CPUX64)}
  4.   {$ASMMODE intel}
  5. {$EndIf}
  6. {$SMARTLINK ON}
  7. {$Calling Register}
  8.  
  9. uses stopwatch;
  10.  
  11. function MinRegular(A, B, C: Integer): Integer;
  12. begin
  13.   if A < B then
  14.     Result:= A;
  15.   if C < Result then
  16.     Result:= C;
  17. end;
  18.  
  19. function MinInline(A, B, C: Integer): Integer;inline;
  20. begin
  21.   if A < B then
  22.     Result:= A;
  23.   if C < Result then
  24.     Result:= C;
  25. end;
  26.  
  27. function MinSIMD(A, B, C: Integer): Integer;assembler;nostackframe;
  28. asm
  29.   movD xmm0,A
  30.   movD xmm1,B
  31.   movD xmm2,C
  32.   pminSD xmm0,xmm1
  33.   pminSD xmm0,xmm2
  34.   movD eax,xmm0
  35. end;
  36.  
  37. function crMinInline(constref A, B, C: Integer): Integer;inline;
  38. begin
  39.   if A < B then
  40.     Result:= A;
  41.   if C < Result then
  42.     Result:= C;
  43. end;
  44.  
  45. function crMinSIMD(constref A, B, C: Integer): Integer;assembler;nostackframe;
  46. asm
  47.   movD xmm0,dword ptr[A]
  48.   movD xmm1,dword ptr[B]
  49.   movD xmm2,dword ptr[C]
  50.   pminSD xmm0,xmm1
  51.   pminSD xmm0,xmm2
  52.   movD eax,xmm0
  53. end;
  54.  
  55. function RealSIMD(A, B, C: pointer; Size: Integer): Integer;assembler;nostackframe;
  56. asm
  57.   movdqu xmm0, [A]
  58.   movdqu xmm1, [B]
  59.   movdqu xmm2, [C]
  60.   jmp    @Check
  61.  
  62.   align  16
  63. @Loop:
  64.  
  65.   pminsd xmm1, xmm0
  66.   movdqu xmm0, [A+16]
  67.   pminsd xmm2, xmm1
  68.   movdqu xmm1, [B+16]
  69.   movdqu [A], xmm2
  70.   movdqu xmm2, [C+16]
  71.  
  72.   pminsd xmm1, xmm0
  73.   movdqu xmm0, [A+32]
  74.   pminsd xmm2, xmm1
  75.   movdqu xmm1, [B+32]
  76.   movdqu [A+16], xmm2
  77.   movdqu xmm2, [C+32]
  78.  
  79.   lea    A, [A+32]
  80.   lea    B, [B+32]
  81.   lea    C, [C+32]
  82.  
  83. @Check:
  84.  
  85.   sub    Size, 8
  86.   jnc    @Loop
  87.  
  88.   pminsd xmm1, xmm0
  89.   pminsd xmm2, xmm1
  90.   movdqu [A], xmm2
  91. end;
  92.  
  93. const
  94.   CSize = 1024 * 1024;
  95. var
  96.   i: SizeInt;
  97.   sw: TStopWatch;
  98.   pL: PLongInt;
  99.   L: LongInt;
  100. begin
  101.   sw:= TStopWatch.Create;
  102.  
  103.   pL:= GetMem(CSize * SizeOf(LongInt));
  104.  
  105.   for i:= 0 to CSize - 1 do
  106.     pL[i]:= Random(10) - 5;
  107.  
  108.   sw.Reset; sw.Start;
  109.   for i:= 0 to CSize - 3 do
  110.     L:= MinRegular(pL[i], pL[i+1], pL[i+2]);
  111.   sw.Stop;
  112.   Writeln('Regular    : ', sw.ElapsedTicks);
  113.  
  114.   sw.Reset; sw.Start;
  115.   for i:= 0 to CSize - 3 do
  116.     L:= MinInline(pL[i], pL[i+1], pL[i+2]);
  117.   sw.Stop;
  118.   Writeln('Inline     : ', sw.ElapsedTicks);
  119.  
  120.   sw.Reset; sw.Start;
  121.   for i:= 0 to CSize - 3 do
  122.     L:= MinSIMD(pL[i], pL[i+1], pL[i+2]);
  123.   sw.Stop;
  124.   Writeln('SIMD       : ', sw.ElapsedTicks);
  125.  
  126.   sw.Reset; sw.Start;
  127.   for i:= 0 to CSize - 3 do
  128.     L:= crMinInline(pL[i], pL[i+1], pL[i+2]);
  129.   sw.Stop;
  130.   Writeln('crInline   : ', sw.ElapsedTicks);
  131.  
  132.   sw.Reset; sw.Start;
  133.   for i:= 0 to CSize - 3 do
  134.     L:= crMinSIMD(pL[i], pL[i+1], pL[i+2]);
  135.   sw.Stop;
  136.   Writeln('crSIMD     : ', sw.ElapsedTicks);
  137.  
  138.   sw.Reset();
  139.   sw.Start();
  140.   RealSIMD( @pL[ 0 ], @pL[ 1 ], @pL[ 2 ], 1048573 );
  141.   sw.Stop();
  142.   WriteLn( LineEnding, 'RealSIMD   : ', sw.ElapsedTicks);
  143. end.

Still not optimal, as the data is misaligned mainly and the core loop is a bot hacked together. In theory it should achieve 4 times the speed of the MinInline and on my machine it gets quite close to this. If you'd use AVX/2/512 you could speed up by another factor of 2-4.
« Last Edit: May 21, 2026, 10:30:25 pm by MathMan »

 

TinyPortal © 2005-2018