Lazarus

Free Pascal => General => Topic started by: Nitorami on March 18, 2019, 06:24:04 pm

Title: Alioth Benchmark game: fannkuch-redux
Post by: Nitorami on March 18, 2019, 06:24:04 pm
Just for fun, I tried to optimize the "fannkuch-redux" program in the Alioth Benchmark.

Under win64, the performance is now some 35% better than with the current version here
https://benchmarksgame-team.pages.debian.net/benchmarksgame/faster/fpascal-gpp.html.

Could someone please test this under Linux, using 64 bit Lazarus and argument 12 ?


Title: Re: Alioth Benchmark game: fannkuch-redux
Post by: Zvoni on March 18, 2019, 08:30:30 pm
Can't find unit mtprocs
Title: Re: Alioth Benchmark game: fannkuch-redux
Post by: Nitorami on March 18, 2019, 08:43:01 pm
mtprocs is a unit for lighweight threads, and comes with Lazarus. Just add the path to the project settings (..components\multithreadprocs)

Title: Re: Alioth Benchmark game: fannkuch-redux
Post by: Zvoni on March 18, 2019, 09:22:20 pm
Oh, you gotta be kidding me!
Create a Benchmark, and then you get such messages:

Threading has been used before cthreads was initialized.
Make cthreads one of the first units in your uses clause.
Runtime error 211 at $0000000000412763
  $0000000000412763
Title: Re: Alioth Benchmark game: fannkuch-redux
Post by: Nitorami on March 18, 2019, 09:33:34 pm
That is why I am asking someone to test it under Linux. Windows does not use unit threads.

Can you please move mtprocs to the end of the uses clause and try again ?
Title: Re: Alioth Benchmark game: fannkuch-redux
Post by: howardpc on March 18, 2019, 10:06:37 pm
Linux output here:
Code: Pascal  [Select]
  1. 3968050
  2. Pfannkuchen(12) = 65
  3. Time : 7.288
Title: Re: Alioth Benchmark game: fannkuch-redux
Post by: Zvoni on March 18, 2019, 10:35:02 pm
That is why I am asking someone to test it under Linux. Windows does not use unit threads.

Can you please move mtprocs to the end of the uses clause and try again ?

I am under 64Bit Linux!
Title: Re: Alioth Benchmark game: fannkuch-redux
Post by: Leledumbo on March 19, 2019, 02:23:40 am
Modified header:
Code: Pascal  [Select]
  1. program fannkuch_10;
  2.  
  3. {$mode objfpc}{$H+}
  4. uses {$IFDEF UNIX} cthreads,  {$ENDIF}  SysUtils, mtprocs;
  5.  
Output (compiled with -CX -XXs -O4 and run with parameter 12):
Code: [Select]
3968050
Pfannkuchen(12) = 65
Time : 9.239
64-bit Manjaro Linux, kernel 4.20.15, ASUS ROG GL503VD i7-7700HQ 16GB DDR4 2400MHz.

After repeating a couple of times, the average is about 9 seconds. The fastest version (C) runs 5.15 seconds on my machine, so it's about 75% slower now. My guess is that mtprocs is way behind openmp implementation, besides compiler codegen itself.
Title: Re: Alioth Benchmark game: fannkuch-redux
Post by: Nitorami on March 19, 2019, 06:40:55 pm
Thank you !

I'm a bit disappointed; based on tests under windows I had expected to come a bit closer to C performance.

I don't think this is related to mtprocs, which I only used because it gave me a simple way to try different thread scheduing. Apart from that, the performance with mtprocs is exactly the same as when using fpc's basic thread support instead.

A few days ago, Akira1346 managed to bring fpc to the top of score in the Binary Trees benchmark, using the PasMP multiprocessing library. I might try it, but doubt it would help for fannkuch, which merely runs a few static threads over the entire program runtime.

So I guess the reason for the lower performance is ultimately fpc's code generation / optimization as such. Probably a bit more on the conservative side than C.
Title: Re: Alioth Benchmark game: fannkuch-redux
Post by: asdf121 on March 19, 2019, 07:31:36 pm
A few days ago, Akira1346 managed to bring fpc to the top of score in the Binary Trees benchmark, using the PasMP multiprocessing library. I might try it, but doubt it would help for fannkuch, which merely runs a few static threads over the entire program runtime.
I don't think that this will help because PasMP provides locking methods which are useful when several threads are working concurrently on the same object. But seems this is not the case on your example...
Maybe it's even faster to not generate too many threads and instead use a fewer number.

So I guess the reason for the lower performance is ultimately fpc's code generation / optimization as such. Probably a bit more on the conservative side than C.
Guess its because OpenMP generates vector asm code and other things for the for-loops which are not supported in FPC or at least not in the used FPC 3.0.4 they used in benchmarks.

But doing such stuff is not trivial, you might use https://www.godbolt.org/ and try several different types of writing it and look at the generated asm code. Sometimes better code gets generated if you use a 'repeat..until' instead of a 'while' loop (not sure, just an example). Another thing you should think about cache using etc.