@Akira: Thanks for submitting this, I guess I would have dragged it on forever...
A note on the inflationary use of the inline modifier: It really only makes sense for very short leaf functions. Certainly not for MakeTree, which is used recursively and cannot be inlined in the first place.
The benefit of inlining on moden processors is dubious to me anyway. Sometimes it even makes code slower; this probably has to do with how the CPU optimises the program flow.
@Thaddy
I have a feeling that many more "competition" examples for FPC are less than optimal.
Not sure. I have tinkered with a few, and could not do a lot. An example is fannkuch-redux, where I may get 10%...20% off. The problem is that the effect of changes is often not predictable, it's try and error, and the result may be different on different systems.
The core of fannkuch is the flip function. This is another example where you get faster code
without inlining. Putting some local variables on the heap rather than the stack also results in a small but disctinct improvement, but only on win64; on win32 the effect is exactly opposite. On Linux, it may again be different (cannot test this).
Thus, even after a lot of manual optimization, we cannot really tell how the code will perform on a different target system. The gcc compiler may simply be better in producing the best code on each specific system.... That said, I am talking about really small effects here, 10-20% up or down, not too far from the performance of gcc, and irrelevent for a real world program.
What's really poor is FPC's performance in the Mandelbrot benchmark, factor ten slower than gcc, and also much slower than many others, Rust, Swift, C#, .NET, Java, even LISP and Haskell. And it cannot be improved - at least I did not manage it - simply because FPC lacks support of vector processing, and gives away factor 8 in performance.