Recent

Author Topic: Bounties to speed up FPC  (Read 6635 times)

Okoba

  • Hero Member
  • *****
  • Posts: 617
Bounties to speed up FPC
« on: February 01, 2025, 12:25:34 pm »
Hello,

There are a couple of issues that are slowing down the result code of FPC that I and certainly others would like them to be solved.
To help this process of optimization, I like to offer bounties for several tasks.

- The mentioned issue needs to be solved, accepted by the FPC team, and merged into the main branch.
- The solution needs to work on supported targets, especially x86 and ARM, 64-bit.
- No personal information is needed, and the payment will be done via USDT instantly after the code gets merged.
- I will be happy to test any suggested code. Please use the issues to communicate on the issues and this topic or private message about bounties.
- Anyone is welcome to participate in solving these or contribute financially.


Optimizing private values ($200): https://gitlab.com/freepascal.org/fpc/source/-/issues/39888
Optimizing out and var parameters ($100): https://gitlab.com/freepascal.org/fpc/source/-/issues/41128


ALLIGATOR

  • Full Member
  • ***
  • Posts: 173
Re: Bounties to speed up FPC
« Reply #1 on: February 01, 2025, 05:02:16 pm »
The second task, with three tests on my machine gives three identical results:
Code: Pascal  [Select][+][-]
  1. 515
  2. 515
  3. 515
i7-8750H

Also, on modern Windows, GetTickCount[64] resolution = 16ms, unlike the Now function, it has a higher Windows timer resolution, I don't remember what function it calls internally, but it will probably have a resolution of 1ms
So I would redo the tests like this (although it won't affect anything significantly):

Code: Pascal  [Select][+][-]
  1. uses
  2.   SysUtils;
  3. var
  4.   t: TDateTime;
  5.  
  6. begin
  7.   t:=Now;
  8.   ...
  9.   WriteLn((Now-t)*MSecsPerDay:0:0);

Okoba

  • Hero Member
  • *****
  • Posts: 617
Re: Bounties to speed up FPC
« Reply #2 on: February 01, 2025, 05:03:23 pm »
Are you using release mode and -O3?

ALLIGATOR

  • Full Member
  • ***
  • Posts: 173
Re: Bounties to speed up FPC
« Reply #3 on: February 01, 2025, 05:46:26 pm »
Are you using release mode and -O3?
Yes, FPC[main] + -O2/O3/O4 - same result


Another thing about optimizing the work with records - it seems that now any records with the number of elements (of simple type, at least) >1 are not optimized to the register

Here's an example:

Code: Pascal  [Select][+][-]
  1. program test;
  2. {$mode objfpc}
  3. {$optimization on}
  4.  
  5. uses
  6.   SysUtils;
  7.  
  8. const size = 128 * 1024 * 1024;
  9.  
  10. type
  11.   TRecord1 = packed record
  12.     I: Int16;
  13.   end;
  14.   TRecord2 = packed record
  15.     I, II: Int16;
  16.   end;
  17.  
  18. generic procedure bench<T>;
  19. var
  20.   R: T;
  21.   cnt: SizeInt;
  22.   timer: TDateTime;
  23. begin
  24.   timer := Now;
  25.   for cnt:=size downto 0 do
  26.     inc(R.I);
  27.   WriteLn((Now - timer)*MSecsPerDay:0:0, ' ms.');
  28. end;
  29.  
  30. begin
  31.   specialize bench<TRecord1>; // 35 ms.
  32.   specialize bench<TRecord2>; // 185 ms.
  33.   ReadLn;
  34. end.
  35.  

Okoba

  • Hero Member
  • *****
  • Posts: 617
Re: Bounties to speed up FPC
« Reply #4 on: February 01, 2025, 06:25:00 pm »
#39888: For records with one field, it was solved by @CuriousKit at https://gitlab.com/freepascal.org/fpc/source/-/issues/37343
But for a real world work, multiple fields are a must.

#41128: Testing on multiple machines is still different, and TestResult is almost twice faster. Your times are similar to running on default with no optimization. I am using Windows 64bit.

ALLIGATOR

  • Full Member
  • ***
  • Posts: 173
Re: Bounties to speed up FPC
« Reply #5 on: February 01, 2025, 06:36:01 pm »
Your times are similar to running on default with no optimization. I am using Windows 64bit.

Code: Pascal  [Select][+][-]
  1. -O0, -O1 (same):
  2. 1719
  3. 1719
  4. 1718
  5.  
  6. -O2,-O3,-O4 (same):
  7. 515
  8. 516
  9. 515

FPC[main] x64, Win11
« Last Edit: February 01, 2025, 06:38:48 pm by ALLIGATOR »

Okoba

  • Hero Member
  • *****
  • Posts: 617
Re: Bounties to speed up FPC
« Reply #6 on: February 01, 2025, 07:06:11 pm »
Here is what I get from -al on a FPC from a minute ago. Clear difference in the generated assembly and still TestResult is almost twice faster. Can you check if it is the same for you?

Code: Pascal  [Select][+][-]
  1. # Var A located at rsp+0, size=OS_S32
  2. # Var B located at rsp+4, size=OS_S32
  3. # Var I located in register eax
  4. # [18] for I := 1 to 1000 * 1000 * 1000 do
  5.         movl    $1,%eax
  6.         .p2align 4,,10
  7.         .p2align 3
  8. .Lj7:
  9. # [19] NextOut(I, A, B);
  10.         leal    1(%eax),%edx
  11.         movl    %edx,(%rsp)
  12.         leal    2(%eax),%edx
  13.         movl    %edx,4(%rsp)
  14.         addl    $1,%eax
  15.         cmpl    $1000000000,%eax
  16.         jng     .Lj7
  17. # [20] end;

Code: Pascal  [Select][+][-]
  1. # Var A located at rsp+0, size=OS_S32
  2. # Var B located at rsp+4, size=OS_S32
  3. # Var I located in register eax
  4. # [32] for I := 1 to 1000 * 1000 * 1000 do
  5.         movl    $1,%eax
  6.         .p2align 4,,10
  7.         .p2align 3
  8. .Lj15:
  9. # [33] NextVar(I, A, B);
  10.         leal    1(%eax),%edx
  11.         movl    %edx,(%rsp)
  12.         leal    2(%eax),%edx
  13.         movl    %edx,4(%rsp)
  14.         addl    $1,%eax
  15.         cmpl    $1000000000,%eax
  16.         jng     .Lj15
  17. # [34] end;

Code: Pascal  [Select][+][-]
  1. # Var A located in register edx
  2. # Var B located in register ecx
  3. # [50] begin
  4. # Var I located in register r8d
  5. # [51] for I := 1 to 1000 * 1000 * 1000 do
  6.         movl    $1,%r8d
  7.         .p2align 4,,10
  8.         .p2align 3
  9. .Lj25:
  10. # [53] A := NextA(I);
  11.         leal    1(%r8d),%edx
  12. # [54] B := NextB(I);
  13.         leal    2(%r8d),%ecx
  14.         addl    $1,%r8d
  15.         cmpl    $1000000000,%r8d
  16.         jng     .Lj25
  17. .Lc23:
  18. # [56] end;

ALLIGATOR

  • Full Member
  • ***
  • Posts: 173
Re: Bounties to speed up FPC
« Reply #7 on: February 01, 2025, 07:11:46 pm »
Another interesting case where the optimization of one simple field breaks down:

Code: Pascal  [Select][+][-]
  1. program test;
  2. {$mode objfpc}
  3. {$optimization on}
  4.  
  5. uses
  6.   SysUtils;
  7.  
  8. const size = 128 * 1024 * 1024;
  9.  
  10. type
  11.   TRecord1 = packed record
  12.     I: Int16;
  13.   end;
  14.  
  15. procedure bench1;
  16. var
  17.   R: TRecord1;
  18.   cnt: SizeInt;
  19.   timer: TDateTime;
  20. begin
  21.   timer := Now;
  22.   for cnt:=size downto 0 do
  23.     inc(R.I);
  24.   WriteLn((Now - timer)*MSecsPerDay:0:0, ' ms.');
  25.  
  26.   timer := Now;
  27.   for cnt:=size downto 0 do
  28.     inc(R.I);
  29.   WriteLn((Now - timer)*MSecsPerDay:0:0, ' ms.');
  30. end;
  31.  
  32. procedure bench2;
  33. var
  34.   R: TRecord1;
  35.   cnt: SizeInt;
  36.   timer: TDateTime;
  37. begin
  38.   timer := Now;
  39.   for cnt:=size downto 0 do
  40.     inc(R.I);
  41.   WriteLn((Now - timer)*MSecsPerDay:0:0, ' ms.');
  42.  
  43.   TRecord1((@R)^).I:=0;
  44. end;
  45.  
  46. begin
  47.   bench1;
  48.   bench2;
  49.   //36
  50.   //36
  51.   //186
  52.   ReadLn;
  53. end.
  54.  
  55.  

ALLIGATOR

  • Full Member
  • ***
  • Posts: 173
Re: Bounties to speed up FPC
« Reply #8 on: February 01, 2025, 07:22:40 pm »
Can you check if it is the same for you?

TestOut:
Code: ASM  [Select][+][-]
  1. lea edx,[eax+$01]
  2. mov [rsp],edx
  3. lea edx,[eax+$02]
  4. mov [rsp+$04],edx
  5.  
  6. add eax,$01
  7. cmp eax,$3B9ACA00
  8. jle -$19

TestVar:
Code: ASM  [Select][+][-]
  1. lea edx,[eax+$01]
  2. mov [rsp],edx
  3. lea edx,[eax+$02]
  4. mov [rsp+$04],edx
  5.  
  6. add eax,$01
  7. cmp eax,$3B9ACA00
  8. jle -$19
  9.  

TestResult:
Code: ASM  [Select][+][-]
  1. lea edx,[r8d+$01]
  2. lea ecx,[r8d+$02]
  3.  
  4. add r8d,$01
  5. cmp r8d,$3B9ACA00
  6. jle -$17
  7.  

Okoba

  • Hero Member
  • *****
  • Posts: 617
Re: Bounties to speed up FPC
« Reply #9 on: February 01, 2025, 07:30:14 pm »
Thank you. So they are different and not using registers as I think they should. I do not know why the times are the same for you.

ALLIGATOR

  • Full Member
  • ***
  • Posts: 173
Re: Bounties to speed up FPC
« Reply #10 on: February 01, 2025, 07:34:14 pm »
So they are different and not using registers as I think they should.
I don't get you...

I do not know why the times are the same for you.
This is most likely a feature of the processor architecture, you have a more modern processor, judging by the timing

LV

  • Sr. Member
  • ****
  • Posts: 283
Re: Bounties to speed up FPC
« Reply #11 on: February 01, 2025, 08:33:49 pm »
From https://gitlab.com/freepascal.org/fpc/source/-/issues/39888 and add Test3

Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. {$MODE Delphi}
  4.  
  5. uses
  6.   SysUtils;
  7.  
  8. type
  9.  
  10.   TTest = record
  11.     V: integer;
  12.     V2: integer;
  13.   end;
  14.  
  15.   procedure Test1;
  16.   var
  17.     I: integer;
  18.     A: TTest;
  19.   begin
  20.     A := Default(TTest);
  21.     for I := 1 to 1000 * 1000 * 1000 do
  22.       A.V += 1;
  23.   end;
  24.  
  25.   procedure Test2;
  26.   var
  27.     I, Temp: integer;
  28.     A: TTest;
  29.   begin
  30.     A := Default(TTest);
  31.     Temp := A.V;
  32.     for I := 1 to 1000 * 1000 * 1000 do
  33.       Temp += 1;
  34.     A.V := Temp;
  35.   end;
  36.  
  37.   procedure Test3;
  38.   var
  39.     I: integer;
  40.     A: ^TTest;
  41.     Temp: int64;
  42.   begin
  43.     New(A);
  44.     A^ := Default(TTest);
  45.  
  46.     Temp := A^.V;
  47.     for I := 1 to 1000 * 1000 * 1000 do
  48.       Temp += 1;
  49.  
  50.     A^.V := Temp;
  51.     Dispose(A);
  52.   end;
  53.  
  54. var
  55.   T: QWord;
  56. begin
  57.   T := GetTickCount64;
  58.   Test1;
  59.   WriteLn(GetTickCount64 - T); // 500
  60.  
  61.   T := GetTickCount64;
  62.   Test2;
  63.   WriteLn(GetTickCount64 - T); // 484
  64.  
  65.   T := GetTickCount64;
  66.   Test3;
  67.   WriteLn(GetTickCount64 - T); // 235
  68.  
  69.   ReadLn;
  70. end.
  71.  

Notebook AMD Ryzen 7 4700U
Windows 11
Lazarus 3.4 (rev lazarus_3_4) FPC 3.2.2 x86_64-win64-win32/win64
O2 and release-mode

test1 500; test2 484; test3 235 ms.

Using a pointer significantly improves performance.

PS. Not for profit, but for enjoyment. ;)

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 11303
  • Debugger - SynEdit - and more
    • wiki
Re: Bounties to speed up FPC
« Reply #12 on: February 01, 2025, 08:39:36 pm »
Just a few words on testing/benchmarking.

Comparing performance can (as in "may possible") be influenced by unexpected side effects.

I have once been personally hit by that (some test I did on utf8 processing / don't have it any more):
- So I had 3 or 4 implementations => One of them was like 30 to 40% faster than the others => great.
- Then I changed something, but something that did NOT change the generated asm for any of the methods, nor for the benchmark.
  I think I may have changed the order in which the functions were listed in the source, or I removed one of the slow ones?
- The super fast one was suddenly slower (it was still the fastest, but by maybe only 5 to 10%)

After some research I found, that the change I made, had changed alignments. That is the exact same asm sequence simple had moved to start at a different address in memory => and that alone changed speed.

The exact effect may depend a lot on the exact CPU used... But for many Intel CPU loops (especially small loops) can be optimized, by some cache for the internal translation inside the CPU (each asm command runs some microcode in the CPU). And that is cached, and depends on alignment.

There may be other such "CPU internals" that can affect a benchmark.

But that means, that any benchmark can be off by a mid 2 digit percentage, by a change that isn't the actually tested difference.




To counter this it may be needed to have variations of the benchmark. Change compiler set alignments, change ordering of various elements / code blocks) => changes that should not make a difference.

Also benchmarks can have artificial data, with memory layout that does not reflect real scenarios. This affects caching, and can boost any one of the tested scenarios in a way they would not experience in real life.

Next, I don't know, but I would guess being worth to check, is branch prediction.
- Afaik it may use some cache of previous executions? Then running a tight loop would help the branch predictor => meaning: Newly introduced extra conditional branches would be measured to take less time, than they may in reality
- Benchmarks may call code in ways that prefers specific branch decisions much more, than real live execution (or bundle them / make them predictable)

...


« Last Edit: February 01, 2025, 08:42:54 pm by Martin_fr »

Okoba

  • Hero Member
  • *****
  • Posts: 617
Re: Bounties to speed up FPC
« Reply #13 on: February 02, 2025, 08:00:04 am »
@ALLIGATOR the result assembly is not as optimized as it could be as it does not use the registers.
@LV It is surprising that why Test2 is slower for you and why Test3 is better. I suggest adding this to the issue.
@Martin_fr You are right, but the real world case was this slow and producing not optimal assembly. I made these tests to showcase it.

Honestly, these can be done more easlity by a C compiler, but I do not like the syntax and this Pascal is much more lovely to me. If you have suggestions on the bounties, please share.

ALLIGATOR

  • Full Member
  • ***
  • Posts: 173
Re: Bounties to speed up FPC
« Reply #14 on: February 02, 2025, 08:13:17 am »
Perhaps if you use the LLVM backend, the code will look more optimal, but... Using LLVM doesn't seem to be something very easy (

It would be very convenient if you could freely switch between the LLVM backend and the regular native backend, and maybe even somehow adjust the way the unit is compiled by the unit

And right now the LLVM backend is only supported on Linux

 

TinyPortal © 2005-2018