Recent

Author Topic: measuring fpc codegen speed  (Read 1071 times)

fcu

  • Jr. Member
  • **
  • Posts: 90
measuring fpc codegen speed
« on: March 03, 2023, 08:42:52 pm »
hi
i wanted to see the difference between fpc releases in term of godegen speed , so i tried measuring png image loading time ( decoding  time ) using fcl-image  , the png is ~5mb , the thing is there is no significant difference , only tiny milliseconds ( fpc versions was 2.6 till 3.3.1 )  , so i am wondering if fpc codegen optimizer has been improved over the releases !

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9869
  • Debugger - SynEdit - and more
    • wiki
Re: measuring fpc codegen speed
« Reply #1 on: March 03, 2023, 09:30:34 pm »
Well... There is work on the optimizer in 3.3.1 (i.e. current git main branch). Afaik a lot on generating better assembler code. Not sure what/if changes on the high level.
And well, you may or may not benefic from it.

There is a lot more (a hell lot more) to the speed of your app, than the speed at which the code itself can be processed.

To start with, modern cpu do quite some part of the optimization themself. That is, even if you have unoptimized code your cpu may run it as fast as optimized code in some (many?) cases.... Though of course that depends on the optimizations that are performed. Better assembler code makes often just marginal differences.

What can differ (and I don't know where FPC stands on it) is optimizing the logic/program flow. Such as moving code that is inside a loop, to outside the loop. In other words rewrite your code for you.

But more - and I guess your png example may partly fall into this - code is not the only factor. Data needs to be optimized too. If there is huge amount of data to be processed then it needs to be loaded (from mem to cache). And if the code run faster (or could run faster) than the time needed to load the data into the cache (and if potentially you have lots of cache misses) then speeding up the code does nothing. The data must be stored in a way that it makes maximum use of each cache line.

And lots of other stuff.
A fun bit of background https://www.youtube.com/watch?v=r-TLSBdHe1A


Leledumbo

  • Hero Member
  • *****
  • Posts: 8757
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: measuring fpc codegen speed
« Reply #2 on: March 03, 2023, 10:07:37 pm »
so i tried measuring png image loading time ( decoding  time ) using fcl-image
Kinda bad selection, the package's slow implementation will bottleneck it. No codegen improvement can save an intentionally chosen to be slow (for portability reason) implementation. If you want to see the actual improvement, use CPU bound codes. After all, codegen is about what the CPU will execute when the code is run. You can pick the benchmark game codes instead. You also don't mention any use of optimization switches, the default is -O0 that doesn't do any optimization, almost like a 1:1 mapping, will not even optimize x := x + 1 statement:
Code: [Select]
# default -O0
  24   │ # [4] x := x + 1;
  25   │     movl    U_$P$PROGRAM_$$_X,%eax
  26   │     leal    1(%eax),%eax
  27   │     movl    %eax,U_$P$PROGRAM_$$_X
 # -O1
  24   │ # [4] x := x + 1;
  25   │     addl    $1,U_$P$PROGRAM_$$_X
Try at least -O2 (-O3 will be better, but -O4 may cause unwanted side effects), 2.6 and 3.2 will show quite some improvements in certain types of codes.

fcu

  • Jr. Member
  • **
  • Posts: 90
Re: measuring fpc codegen speed
« Reply #3 on: March 04, 2023, 10:38:51 am »
thanks
i choose fcl-image because its pure pascal and png decoder has alot of calculations, and the implementation still the same , so if the compiler is improved sure it will do better work and the speed would be tangible.

but iam not sure if fcl-image package was compiled with optimization enabled or not !

@Leledumbo the test was done with -O3 switch   

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9869
  • Debugger - SynEdit - and more
    • wiki
Re: measuring fpc codegen speed
« Reply #4 on: March 04, 2023, 12:05:51 pm »
Most of the optimizations I have seen being added in 3.3.1 are for better assembler code.

E.g
- replacing "conditional jump" with "conditional set value" (which means the cpu wont have to predict the jump)
- changing the order of 2 statements, to allow the CPU to compute ahead more statements  (register-rename, pipeline-stalls, ...)

All those will gain time, if the CPU did not find ways to optimize the lesser code on it's own.

And also the time gain is not that big. Well, if you put a statement that benefits, into a loop (1 to 1 million), and nothing else is in that loop, then it's noticeable. In real life, the code that benefits makes a few percent of your app's code, and if a few percent are speed up just a little, the overall app wont have much of a measurable gain.


The other form of *code* optimization is changes to the actual code before (during) compiling it.

Code: Pascal  [Select][+][-]
  1. for a := 1 to 100000 do begin b := a * x; writeln(b); end;
could become
Code: Pascal  [Select][+][-]
  1. b := x; for a := 1 to 100000 do begin writeln(b); b := b + 1; end;
The addition computes faster than the multiplication.
(In the example the slow "writeln" will eat most of the time / In real life this may gain some speed, but depends... If the cpu was able to do other work of the loop, while the multiplication was done, then the benefit may be small(er))

Such code exists for example when accessing array elements.

I don't know what FPC does on this sort of code.... It is possible that there is still some potential for fpc improvements.

Long ago, I did this by hand (among other stuff). https://gitlab.com/freepascal.org/fpc/source/-/issues/10275
There are rewritten code examples, that manually do such implementations. Of course read-ability of the code suffers a lot.
But that particular code gained IIRC approx 40% speed.
(Though that was a very old fpc version, and I tricked fpc into doing some register optimization, that it may nowadays do without tricks, yet some of those code changes may still make a difference on similar code when using a current fpc)


Then there is stuff the compiler generally wont do for you. The choice of algorithm. Using sorted data and do binary search, or even doing hash lookups => that can speed up an app by several orders of magnitude.



And yet then again, as I said: data in memory needs to be optimized too. Or the order in which it is accessed. And I don't know if any compiler will do much of that for you.

As an example google "optimizing matrix multiplication".
If you just do nested loops over the data, you get a lot of access to memory addresses far away from each other. That means you completely loose the benefit from holding data in the cpu cache. And any memory operations that can not be cached, is slow. For large data that can be truly significant.



Having said all that, there is room in fpc for more optimizations.
And maybe, or maybe not, there are a few things left that may gain you more than 2 or 3 percent (on very specific code).

But my experience is that today's fpc allows you to write very well performing code already.



« Last Edit: March 04, 2023, 12:09:26 pm by Martin_fr »

Awkward

  • Full Member
  • ***
  • Posts: 135
Re: measuring fpc codegen speed
« Reply #5 on: March 04, 2023, 12:23:30 pm »

Code: Pascal  [Select][+][-]
  1. for a := 1 to 100000 do begin b := a * x; writeln(b); end;
could become
Code: Pascal  [Select][+][-]
  1. b := x; for a := 1 to 100000 do begin writeln(b); b := b + 1; end;

sorry, did you noticed what your 2nd code line is not equivalent of 1st code line?

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9869
  • Debugger - SynEdit - and more
    • wiki
Re: measuring fpc codegen speed
« Reply #6 on: March 04, 2023, 12:34:11 pm »

Code: Pascal  [Select][+][-]
  1. for a := 1 to 100000 do begin b := a * x; writeln(b); end;
could become
Code: Pascal  [Select][+][-]
  1. b := x; for a := 1 to 100000 do begin writeln(b); b := b + 1; end;

sorry, did you noticed what your 2nd code line is not equivalent of 1st code line?

Ok, typo, should be "+x" instead of "+1"
But the idea is still the same.

Paolo

  • Hero Member
  • *****
  • Posts: 510
Re: measuring fpc codegen speed
« Reply #7 on: March 04, 2023, 12:36:38 pm »
maybe the second one should be...

Code: Pascal  [Select][+][-]
  1. b := x;
  2. for a := 1 to 100000 do begin
  3.   writeln(b);
  4.   b := b + X; //<---- add X here
  5. end;
  6.  

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11452
  • FPC developer.
Re: measuring fpc codegen speed
« Reply #8 on: March 04, 2023, 12:46:34 pm »
Also, the fcl-image loaders are very generic, and therefore won't benefit (much) from optimization

dseligo

  • Hero Member
  • *****
  • Posts: 1221
Re: measuring fpc codegen speed
« Reply #9 on: March 04, 2023, 01:17:48 pm »
maybe the second one should be...

Code: Pascal  [Select][+][-]
  1. b := x;
  2. for a := 1 to 100000 do begin
  3.   writeln(b);
  4.   b := b + X; //<---- add X here
  5. end;
  6.  

It should be like this to be equivalent :)

Code: Pascal  [Select][+][-]
  1. b := 0;
  2. for a := 1 to 100000 do begin
  3.   b := b + x;
  4.   writeln(b);
  5. end;

Paolo

  • Hero Member
  • *****
  • Posts: 510
Re: measuring fpc codegen speed
« Reply #10 on: March 04, 2023, 06:07:01 pm »
yes.
the code was equivalent for the output purpose, not for the B content  :D

korba812

  • Sr. Member
  • ****
  • Posts: 394
Re: measuring fpc codegen speed
« Reply #11 on: March 04, 2023, 06:08:09 pm »
hi
i wanted to see the difference between fpc releases in term of godegen speed , so i tried measuring png image loading time ( decoding  time ) using fcl-image  , the png is ~5mb , the thing is there is no significant difference , only tiny milliseconds ( fpc versions was 2.6 till 3.3.1 )  , so i am wondering if fpc codegen optimizer has been improved over the releases !
Maybe the deflate implementation is not efficient and any optimization on the compiler side won't help anything?

lagprogramming

  • Sr. Member
  • ****
  • Posts: 406
Re: measuring fpc codegen speed
« Reply #12 on: March 05, 2023, 11:29:49 am »
1/4 packages/fcl-image/src/fpreadpng.pp contains the following procedure:
Code: Pascal  [Select][+][-]
  1. procedure TFPReaderPNG.InternalRead (Str:TStream; Img:TFPCustomImage);
  2. begin
  3.   {$ifdef FPC_Debug_Image}
  4.   if Str<>TheStream then
  5.     writeln('WARNING: TFPReaderPNG.InternalRead Str<>TheStream');  
  6.   {$endif}
  7.   with Header do
  8.     Img.SetSize (Width, Height);
  9.   ZData := TMemoryStream.Create;
  10.   try
  11.     EndOfFile := false;
  12.     while not EndOfFile do
  13.       begin
  14.       ReadChunk;
  15.       HandleChunk;
  16.       end;
  17.     ZData.position:=0;
  18.     Decompress := TDecompressionStream.Create (ZData);
  19.     try
  20.       DoDecompress;
  21.     finally
  22.       Decompress.Free;
  23.     end;
  24.   finally
  25.     ZData.Free;
  26.     if not img.UsePalette and assigned(FPalette) then
  27.       begin
  28.       FreeAndNil(FPalette);
  29.       end;
  30.   end;
  31. end;

I think the source code can be improved by replacing the "while" loop with a "repeat" loop like:
Code: Pascal  [Select][+][-]
  1. repeat
  2. ReadChunk;
  3. HandleChunk;
  4. until EndOfFile;

2/4 There are a couple of reasons why fcl-image is slow, few of them having to do with compiler optimizations. A significant improvement would be to get rid of those "swap" functions in fpimgcmn.pp and "swap32" in qoicomn.pas by replacing their usage with fpc's optimized versions named "SwapEndian". The code would be cleaner, faster and easier to maintain. It shouldn't be a big deal to make those changes. Most likely, a bigger problem is to test those changes in order to make sure everything is working properly as new bugs can be easily added.
As a side note, the implementation part of packages/fcl-image/src/fpreadxwd.pas starts with:
Code: Pascal  [Select][+][-]
  1. //==============================================================================
  2. // Endian utils
  3. //
  4. // Copied from LCLProc unit
  5. //==============================================================================
  6. {$push}{$R-}
  7. function BEtoN(const AValue: DWord): DWord;
  8. begin
  9.   {$IFDEF ENDIAN_BIG}
  10.     Result := AValue;
  11.   {$ELSE}
  12.     Result := (AValue shl 24)
  13.            or ((AValue and $0000FF00) shl 8)
  14.            or ((AValue and $00FF0000) shr 8)
  15.            or (AValue shr 24);
  16.   {$ENDIF}
  17. end;
  18. {$pop}  
Obviously, same as with the "swap" functions, an improvement would be to get rid of this one, too.

3/4 packages/fcl-image/src/fpreadpng.pp contains class function TFPReaderPNG.InternalSize(Str: TStream): TPoint; and function TFPReaderPNG.InternalCheck (Str:TStream) : boolean;. These two functions compare 8 byte blocks byte by byte using inefficient "for" loops.

4/4 packages/fcl-image/src/fpreadpng.pp contains procedure TFPReaderPNG.HandleAlpha;, which calls the "move" procedure to copy word sized values. The preparation to call the "move" procedure is more expensive than the actual 2 bytes copy.

There are additional ways to make fcl-image run faster without improving the compiler but as we can see, significant code changes are needed.

 

TinyPortal © 2005-2018