Most of the optimizations I have seen being added in 3.3.1 are for better assembler code.
E.g
- replacing "conditional jump" with "conditional set value" (which means the cpu wont have to predict the jump)
- changing the order of 2 statements, to allow the CPU to compute ahead more statements (register-rename, pipeline-stalls, ...)
All those will gain time, if the CPU did not find ways to optimize the lesser code on it's own.
And also the time gain is not that big. Well, if you put a statement that benefits, into a loop (1 to 1 million), and nothing else is in that loop, then it's noticeable. In real life, the code that benefits makes a few percent of your app's code, and if a few percent are speed up just a little, the overall app wont have much of a measurable gain.
The other form of *code* optimization is changes to the actual code before (during) compiling it.
for a := 1 to 100000 do begin b := a * x; writeln(b); end;
could become
b := x; for a := 1 to 100000 do begin writeln(b); b := b + 1; end;
The addition computes faster than the multiplication.
(In the example the slow "writeln" will eat most of the time / In real life this may gain some speed, but depends... If the cpu was able to do other work of the loop, while the multiplication was done, then the benefit may be small(er))
Such code exists for example when accessing array elements.
I don't know what FPC does on this sort of code.... It is possible that there is still some potential for fpc improvements.
Long ago, I did this by hand (among other stuff).
https://gitlab.com/freepascal.org/fpc/source/-/issues/10275There are rewritten code examples, that manually do such implementations. Of course read-ability of the code suffers a lot.
But that particular code gained IIRC approx 40% speed.
(Though that was a very old fpc version, and I tricked fpc into doing some register optimization, that it may nowadays do without tricks, yet some of those code changes may still make a difference on similar code when using a current fpc)
Then there is stuff the compiler generally wont do for you. The choice of algorithm. Using sorted data and do binary search, or even doing hash lookups => that can speed up an app by several orders of magnitude.
And yet then again, as I said: data in memory needs to be optimized too. Or the order in which it is accessed. And I don't know if any compiler will do much of that for you.
As an example google "optimizing matrix multiplication".
If you just do nested loops over the data, you get a lot of access to memory addresses far away from each other. That means you completely loose the benefit from holding data in the cpu cache. And any memory operations that can not be cached, is slow. For large data that can be truly significant.
Having said all that, there is room in fpc for more optimizations.
And maybe, or maybe not, there are a few things left that may gain you more than 2 or 3 percent (on very specific code).
But my experience is that today's fpc allows you to write very well performing code already.