measuring fpc codegen speed

Forum > General

(1/3) > >>

fcu:
hi
i wanted to see the difference between fpc releases in term of godegen speed , so i tried measuring png image loading time ( decoding time ) using fcl-image , the png is ~5mb , the thing is there is no significant difference , only tiny milliseconds ( fpc versions was 2.6 till 3.3.1 ) , so i am wondering if fpc codegen optimizer has been improved over the releases !

Martin_fr:
Well... There is work on the optimizer in 3.3.1 (i.e. current git main branch). Afaik a lot on generating better assembler code. Not sure what/if changes on the high level.
And well, you may or may not benefic from it.

There is a lot more (a hell lot more) to the speed of your app, than the speed at which the code itself can be processed.

To start with, modern cpu do quite some part of the optimization themself. That is, even if you have unoptimized code your cpu may run it as fast as optimized code in some (many?) cases.... Though of course that depends on the optimizations that are performed. Better assembler code makes often just marginal differences.

What can differ (and I don't know where FPC stands on it) is optimizing the logic/program flow. Such as moving code that is inside a loop, to outside the loop. In other words rewrite your code for you.

But more - and I guess your png example may partly fall into this - code is not the only factor. Data needs to be optimized too. If there is huge amount of data to be processed then it needs to be loaded (from mem to cache). And if the code run faster (or could run faster) than the time needed to load the data into the cache (and if potentially you have lots of cache misses) then speeding up the code does nothing. The data must be stored in a way that it makes maximum use of each cache line.

And lots of other stuff.
A fun bit of background https://www.youtube.com/watch?v=r-TLSBdHe1A

Leledumbo:

--- Quote from: fcu on March 03, 2023, 08:42:52 pm ---so i tried measuring png image loading time ( decoding time ) using fcl-image

--- End quote ---
Kinda bad selection, the package's slow implementation will bottleneck it. No codegen improvement can save an intentionally chosen to be slow (for portability reason) implementation. If you want to see the actual improvement, use CPU bound codes. After all, codegen is about what the CPU will execute when the code is run. You can pick the benchmark game codes instead. You also don't mention any use of optimization switches, the default is -O0 that doesn't do any optimization, almost like a 1:1 mapping, will not even optimize x := x + 1 statement:

--- Code: --- # default -O0
24 │ # [4] x := x + 1;
25 │ movl U_$P$PROGRAM_$$_X,%eax
26 │ leal 1(%eax),%eax
27 │ movl %eax,U_$P$PROGRAM_$$_X
# -O1
24 │ # [4] x := x + 1;
25 │ addl $1,U_$P$PROGRAM_$$_X

--- End code ---
Try at least -O2 (-O3 will be better, but -O4 may cause unwanted side effects), 2.6 and 3.2 will show quite some improvements in certain types of codes.

fcu:
thanks
i choose fcl-image because its pure pascal and png decoder has alot of calculations, and the implementation still the same , so if the compiler is improved sure it will do better work and the speed would be tangible.

but iam not sure if fcl-image package was compiled with optimization enabled or not !

@Leledumbo the test was done with -O3 switch

Martin_fr:
Most of the optimizations I have seen being added in 3.3.1 are for better assembler code.

E.g
- replacing "conditional jump" with "conditional set value" (which means the cpu wont have to predict the jump)
- changing the order of 2 statements, to allow the CPU to compute ahead more statements (register-rename, pipeline-stalls, ...)

All those will gain time, if the CPU did not find ways to optimize the lesser code on it's own.

And also the time gain is not that big. Well, if you put a statement that benefits, into a loop (1 to 1 million), and nothing else is in that loop, then it's noticeable. In real life, the code that benefits makes a few percent of your app's code, and if a few percent are speed up just a little, the overall app wont have much of a measurable gain.

The other form of *code* optimization is changes to the actual code before (during) compiling it.

--- Code: Pascal [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---for a := 1 to 100000 do begin b := a * x; writeln(b); end;could become

--- Code: Pascal [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---b := x; for a := 1 to 100000 do begin writeln(b); b := b + 1; end;The addition computes faster than the multiplication.
(In the example the slow "writeln" will eat most of the time / In real life this may gain some speed, but depends... If the cpu was able to do other work of the loop, while the multiplication was done, then the benefit may be small(er))

Such code exists for example when accessing array elements.

I don't know what FPC does on this sort of code.... It is possible that there is still some potential for fpc improvements.

Long ago, I did this by hand (among other stuff). https://gitlab.com/freepascal.org/fpc/source/-/issues/10275
There are rewritten code examples, that manually do such implementations. Of course read-ability of the code suffers a lot.
But that particular code gained IIRC approx 40% speed.
(Though that was a very old fpc version, and I tricked fpc into doing some register optimization, that it may nowadays do without tricks, yet some of those code changes may still make a difference on similar code when using a current fpc)

Then there is stuff the compiler generally wont do for you. The choice of algorithm. Using sorted data and do binary search, or even doing hash lookups => that can speed up an app by several orders of magnitude.

And yet then again, as I said: data in memory needs to be optimized too. Or the order in which it is accessed. And I don't know if any compiler will do much of that for you.

As an example google "optimizing matrix multiplication".
If you just do nested loops over the data, you get a lot of access to memory addresses far away from each other. That means you completely loose the benefit from holding data in the cpu cache. And any memory operations that can not be cached, is slow. For large data that can be truly significant.

Having said all that, there is room in fpc for more optimizations.
And maybe, or maybe not, there are a few things left that may gain you more than 2 or 3 percent (on very specific code).

But my experience is that today's fpc allows you to write very well performing code already.

Navigation

[0] Message Index

[#] Next page