Recent

Author Topic: FPC.cfg or fpc options to improve performance (Win x64)  (Read 1308 times)

HenrikErlandsson

  • New Member
  • *
  • Posts: 33
  • ^ Happy coder :)
    • Coppershade
FPC.cfg or fpc options to improve performance (Win x64)
« on: August 31, 2022, 10:04:37 pm »
I'm experienced in hardware and programming, but I don't know which fpc command line compiler options / fpc.cfg options improve execution speed for modern CPUs.

I've installed fpc 3.2.2 with the latest Lazarus for Win x64. (x86_64-win64)

When using fpc in PowerShell and Measure-Command, I've tried to rename fpc.cfg and set my own options, such as -g- and -g+ but I've not come to a good conclusion about typical flags to set for a "Release" mode. (In Lazarus previously, there was a Release mode that you didn't have to create yourself, and which could show you some example options. In Delphi 7, it was relatively easy to experiment with the options and set them for a Release version.)

I'm also looking for good example, where you can get the execution time down from the default fpc / fpc.cfg options shipped with Lazarus. (I'm thinking a single-unit program without GUI, and not using client interfaces much (not lots of writeln)?)

The reason for posting was this video: https://www.youtube.com/watch?v=NLlWpmrlbPo

In the video, it's likely that his gcc also builds for target x86_64, even though it's not explicitly shared. I can confirm a higher time for fpc also on a later and faster CPU.

Is the example a bad one? I think I would like something that uses the FPU, cache, stack, and memory allocation. This one is limited to integers, and the conclusion that I've drawn is that the example measures loop overhead (the stuff in between the code) much more than it does the code inside the loops.

Here is the code, with only "sum := 0" to make it closer to the C code, and unnecessary begin/ends removed. It did not affect the execution time.

Code: Pascal  [Select][+][-]
  1. program numbers;
  2.  
  3. var
  4.   num, i, j, sum: integer;
  5.  
  6. begin
  7.   sum := 0;
  8.   num := 20000;
  9.   for j := num downto 1 do
  10.   begin
  11.     for i := (num - 1) downto 1 do
  12.       if num mod i = 0 then sum := sum + i;
  13.  
  14.     if num = sum then writeln('num: ', num, ' / sum: ', sum);
  15.  
  16.     sum := 0;
  17.     num := num - 1;
  18.   end;
  19. end.

Most of all, I'm after some typical commandline or fpc.cfg for Release (and Debug).

I have search for this topic here, but some results were quite old and the IDE has changed.

A big concern is that I can't find a document that says what each compiler settings is set to as Default. The settings I know are relevant are of course debugger stuff and range etc checking, but I don't know what the default value is set to.
« Last Edit: August 31, 2022, 10:17:44 pm by HenrikErlandsson »
Pushed on stack: 6502 / Z80 / Amiga / ARM assembler, Pascal, Delphi, Lingo, Obj-C, Lua, FPC, C# + web front-/backend.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9874
  • Debugger - SynEdit - and more
    • wiki
Re: FPC.cfg or fpc options to improve performance (Win x64)
« Reply #1 on: August 31, 2022, 10:34:58 pm »
For release, you really don't need much. Unless you have somewhere (in some config) switched debugging stuff on.

You want to go with  -O3  -Si

If you need -g- should turn lots of debugging stuff off (but none of it is on by default...)
Same -Sa- -CR- -Cr- -Ci- -Co- -Ct-

For debugging you go with
-CRriot -Sa -O- -gh

and then alternating with one of the following 5 options
-gt-
-gt- -gt
-gt- -gtt
-gt- -gttt
-gt- -gtttt

depending on the debugger:
lldb: -gw
gdb, maybe lldb: -gw  -godwarfsets 
fpdebug:   -gw3 



If you measure speed ....
(comparing between those settings, or comparing to other compilers)
Or even comparing the same settings, but changes to the code...

Someone from the fpc team recently mailed a link to some very interesting research...

The side effects of a change can cause way bigger speed gains/losses than the change itself.

In example:
You go from -O1 to -O2 and maybe that will make it 5% faster. (if side effects are eliminated.
But the side effects of this may cause a 10% speed change which at random could be gain or loss.
So yes, same code with -O2 can be slower....

Side effects would be that because the length of machine code changes, other machine code will start at a changed address.

Its down to how modern CPU (CISC, like intel) work.
One example: The exact same code, may run 10% (or more faster/slower) if the entire code gets moved just a few bytes to a different start address.


So measurements can be misleading.

HenrikErlandsson

  • New Member
  • *
  • Posts: 33
  • ^ Happy coder :)
    • Coppershade
Re: FPC.cfg or fpc options to improve performance (Win x64)
« Reply #2 on: September 01, 2022, 02:08:24 am »
Thx Martin_fr,

The example looks simple, even bare bones. I hoped it would be simple enough to find any issue with the generated code. -O3 -Si reduces execution time consistently by 38% (compared to no options at all; using the Lazarus-provided fpc.cfg, I presume).

Other simple examples, like Rosetta Pi calculating 999 decimals execute very fast, in a few ms.

I'm wondering if it has something to do with multithreading? A common problem for all executables in all languages is that too much work is performed on the main thread, which normally uses only the first CPU core.

A lot of years ago, Pascal was say, 20% slower than C. Not 200% slower as in the video. It's very strange.

Is alignment really a thing for 64-bit Intel CPUs? Is there a directive or similar that can be used to "align code"? I've not come across this on any CPUs I've coded Assembler for.
Pushed on stack: 6502 / Z80 / Amiga / ARM assembler, Pascal, Delphi, Lingo, Obj-C, Lua, FPC, C# + web front-/backend.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9874
  • Debugger - SynEdit - and more
    • wiki
Re: FPC.cfg or fpc options to improve performance (Win x64)
« Reply #3 on: September 01, 2022, 02:43:55 am »
I'm wondering if it has something to do with multithreading? A common problem for all executables in all languages is that too much work is performed on the main thread, which normally uses only the first CPU core.
The Freepascal compiler does not add threading to your app, if you haven't.
If you want to have threaded execution, you need to write code that uses Thread.

Quote
A lot of years ago, Pascal was say, 20% slower than C. Not 200% slower as in the video. It's very strange.
On the danger of seeming picky, but it is not about the language. It is about the compiler.

Of course when speaking of the past you could always imply "even the compilers that produced the fastest code for that language"...


Quote
Is alignment really a thing for 64-bit Intel CPUs? Is there a directive or similar that can be used to "align code"? I've not come across this on any CPUs I've coded Assembler for.
https://www.youtube.com/watch?v=r-TLSBdHe1A
If you are in a hurry start at minute 10.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9874
  • Debugger - SynEdit - and more
    • wiki
Re: FPC.cfg or fpc options to improve performance (Win x64)
« Reply #4 on: September 01, 2022, 03:06:31 am »
And just to add onto the alignment, it was pointed out to me, when I run into it myself.
Comparing 4 versions of code doing the same job. And the alignment made a significant difference (explaining really weird results that I got, and couldn't explain when I had no idea).

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9874
  • Debugger - SynEdit - and more
    • wiki
Re: FPC.cfg or fpc options to improve performance (Win x64)
« Reply #5 on: September 01, 2022, 03:22:00 am »
There are other optimizations that you can use, if you need to speed up some code. But they do not come as command line option.

Code: Pascal  [Select][+][-]
  1. procedure Foo(const a: TFoo; const b: string);
The compiler may pass the data as reference, if it thinks that would benefit. And with AnsiString the compiler will omit the ref-counting.

However, remember this syntax means: you promised the compiler a and b will not change.
The compiler will reject the obvious. But...
Say you pass "GlobalVarA" => and the code (or any code in nested subroutines or callbacks) changes GlobalVarA => then you broke the promise. And then the app can randomly fail. It may run well for 100 times, and then fail the 101th time.


Code: Pascal  [Select][+][-]
  1. {$ImplicitExceptions off}
May save a bit of time. Usually when you deal with ansistrings or dyn array.

But, if you use exceptions, then it will leak memory (and if that builds up, really slow down things and crash eventually)

Basically, if you deal with managed types (ansistring, dyn array, interface, ...) then the compiler inserts a hidden "try finally" to make sure the mem is freed.

The directive means there is no "try finally". Just code to free the memory. Works fine, if the procedure is exited normally.


google "freepascal whole program optimization".
It can get you another small increase.  (e.g. de-virtualization)


But the most gains will be in choosing a good algorithm. (Big O / thread pools / ...)

And if you have huge data, organizing it to reduce cache misses. (I.e. optimize the layout for the order in which you access it)


« Last Edit: September 01, 2022, 03:52:54 am by Martin_fr »

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9874
  • Debugger - SynEdit - and more
    • wiki
Re: FPC.cfg or fpc options to improve performance (Win x64)
« Reply #6 on: September 01, 2022, 03:31:35 am »
And I forgot, you can (google / see Lazarus IDE project settings) compile for a minimum supported CPU

I.e. require CoreI, or something. Then the compiler can make use of newer asm. Of course older PC will not run this.

IIRC also enable  AVX. (of course depends if your code benefits...)

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11453
  • FPC developer.
Re: FPC.cfg or fpc options to improve performance (Win x64)
« Reply #7 on: September 01, 2022, 12:13:52 pm »
Comparing percentages for small single source math problems is not always the most useful benchmarks.  Problems specified in one function in one single file can sometimes be calculated compiletime by advanced compilers.

Also it is not all calculation, possibly the advantage measures default runtime I/O more than calculation.

Proper benchmarking is an art, and the video is a very low quality one. And even less so because there is zero analysis of the result, just a number. It would have been easy to eliminate the I/O from the benchmark and/or look at the generated source.

HenrikErlandsson

  • New Member
  • *
  • Posts: 33
  • ^ Happy coder :)
    • Coppershade
Re: FPC.cfg or fpc options to improve performance (Win x64)
« Reply #8 on: September 01, 2022, 11:25:58 pm »
@Martin_fr, interesting. But the 2.8% and 40% are still quite far from the +200% difference, and the example should end up being small enough for cache and paging to not matter as much. The executable size is quite big, as in with default options a few lines of code starts at 45K. I say this not to change subject, there is already a thread on the subject, but because maybe the size causes paging in the middle of the inner loop - or something.

@marcov, I agree it's not the most useful example, it doesn't use the stack or memory operations much, for example. I would prefer something larger and involving some common techniques like a sort algorithm with doubly linked lists or binary trees.

I've already played a little bit and can't make any substantial changes since it's so simple. And removing the writeln changes nothing, after all it only prints 4 lines. But it could be fun to play and find out, or read how, to optimize a better example for speed (specifically for fpc).
Pushed on stack: 6502 / Z80 / Amiga / ARM assembler, Pascal, Delphi, Lingo, Obj-C, Lua, FPC, C# + web front-/backend.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9874
  • Debugger - SynEdit - and more
    • wiki
Re: FPC.cfg or fpc options to improve performance (Win x64)
« Reply #9 on: September 02, 2022, 03:24:03 am »
Well, here is another way to approach speed.
valgrind --tool=callgrind  yourapp
kgachegrind callgrind.....

Both are linux only.

It's a profiler. Old style, yes. But being aware of side-effects (if you ever have better code that runs slower, you wont go crazy ... hopefully), profiling is still a good tool. (IMHO)
And mind, yourapp is going to run in slow-motion while being profiled.




I wouldn't worry about code size and swapping....
In your most inner loop (if it is time critical), you would probably have a small amount of code. And few to none subroutine calls (or inlined).



Also fpc main branch had some work on better asm generation, might save a few bytes of code, and gain a little bit of speed.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11453
  • FPC developer.
Re: FPC.cfg or fpc options to improve performance (Win x64)
« Reply #10 on: September 02, 2022, 01:04:11 pm »
You could also try to put the code in a function and not run it from main.

BrunoK

  • Sr. Member
  • ****
  • Posts: 452
  • Retired programmer
Re: FPC.cfg or fpc options to improve performance (Win x64)
« Reply #11 on: September 02, 2022, 05:44:18 pm »
A question to the original poster : did you do the test yourself on your computer ?

Peeking at the assembly generated by FPC 3.2.2 for Windows X86_64, there doesn't seem that better assembler generation could cut execution time for this little program from 2.31 secs for FPC to 0.741 for C++, except, if c++ can do parallelization. Does it ?

The various compiler optimisation levels have nearly zero effect on timing result using GetTickCount64 (maybe 5% at most and ~10% for 3.2.2 for i386 target).

You could also try to put the code in a function and not run it from main.
I tried it, it doesn't change anything significantly.

HenrikErlandsson

  • New Member
  • *
  • Posts: 33
  • ^ Happy coder :)
    • Coppershade
Re: FPC.cfg or fpc options to improve performance (Win x64)
« Reply #12 on: September 02, 2022, 08:39:30 pm »
@BrunoK, the test in the video is on Debian. For my tests I've been running it in PowerShell on Win7x64 and measuring ticks with {Measure-Command .\my.exe | Out-Default}

Yesterday I tested out all the compiler options in Lazarus and how they affected speed and size. I set up a project and Release config and can replicate the results from commandline.

I was toying with the idea that his gcc might compile for i386 and that is somehow faster. As I see it, I just follow the instructions on the Wiki and set the platform and target CPU family? And the "hidden/built-in" math/output units will be recompiled for i386 as well?
Pushed on stack: 6502 / Z80 / Amiga / ARM assembler, Pascal, Delphi, Lingo, Obj-C, Lua, FPC, C# + web front-/backend.

 

TinyPortal © 2005-2018