Recent

Author Topic: Alioth Benchmark game: fannkuch-redux  (Read 1662 times)

Nitorami

  • Sr. Member
  • ****
  • Posts: 368
Alioth Benchmark game: fannkuch-redux
« on: March 18, 2019, 06:24:04 pm »
Just for fun, I tried to optimize the "fannkuch-redux" program in the Alioth Benchmark.

Under win64, the performance is now some 35% better than with the current version here
https://benchmarksgame-team.pages.debian.net/benchmarksgame/faster/fpascal-gpp.html.

Could someone please test this under Linux, using 64 bit Lazarus and argument 12 ?



Zvoni

  • Sr. Member
  • ****
  • Posts: 299
Re: Alioth Benchmark game: fannkuch-redux
« Reply #1 on: March 18, 2019, 08:30:30 pm »
Can't find unit mtprocs
One System to rule them all, One IDE to find them,
One Code to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
People call me crazy, because i'm jumping out of perfectly fine aircrafts

Nitorami

  • Sr. Member
  • ****
  • Posts: 368
Re: Alioth Benchmark game: fannkuch-redux
« Reply #2 on: March 18, 2019, 08:43:01 pm »
mtprocs is a unit for lighweight threads, and comes with Lazarus. Just add the path to the project settings (..components\multithreadprocs)


Zvoni

  • Sr. Member
  • ****
  • Posts: 299
Re: Alioth Benchmark game: fannkuch-redux
« Reply #3 on: March 18, 2019, 09:22:20 pm »
Oh, you gotta be kidding me!
Create a Benchmark, and then you get such messages:

Threading has been used before cthreads was initialized.
Make cthreads one of the first units in your uses clause.
Runtime error 211 at $0000000000412763
  $0000000000412763
One System to rule them all, One IDE to find them,
One Code to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
People call me crazy, because i'm jumping out of perfectly fine aircrafts

Nitorami

  • Sr. Member
  • ****
  • Posts: 368
Re: Alioth Benchmark game: fannkuch-redux
« Reply #4 on: March 18, 2019, 09:33:34 pm »
That is why I am asking someone to test it under Linux. Windows does not use unit threads.

Can you please move mtprocs to the end of the uses clause and try again ?

howardpc

  • Hero Member
  • *****
  • Posts: 3201
Re: Alioth Benchmark game: fannkuch-redux
« Reply #5 on: March 18, 2019, 10:06:37 pm »
Linux output here:
Code: Pascal  [Select]
  1. 3968050
  2. Pfannkuchen(12) = 65
  3. Time : 7.288

Zvoni

  • Sr. Member
  • ****
  • Posts: 299
Re: Alioth Benchmark game: fannkuch-redux
« Reply #6 on: March 18, 2019, 10:35:02 pm »
That is why I am asking someone to test it under Linux. Windows does not use unit threads.

Can you please move mtprocs to the end of the uses clause and try again ?

I am under 64Bit Linux!
One System to rule them all, One IDE to find them,
One Code to bring them all, and to the Framework bind them,
in the Land of Redmond, where the Windows lie
---------------------------------------------------------------------
People call me crazy, because i'm jumping out of perfectly fine aircrafts

Leledumbo

  • Hero Member
  • *****
  • Posts: 8114
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: Alioth Benchmark game: fannkuch-redux
« Reply #7 on: March 19, 2019, 02:23:40 am »
Modified header:
Code: Pascal  [Select]
  1. program fannkuch_10;
  2.  
  3. {$mode objfpc}{$H+}
  4. uses {$IFDEF UNIX} cthreads,  {$ENDIF}  SysUtils, mtprocs;
  5.  
Output (compiled with -CX -XXs -O4 and run with parameter 12):
Code: [Select]
3968050
Pfannkuchen(12) = 65
Time : 9.239
64-bit Manjaro Linux, kernel 4.20.15, ASUS ROG GL503VD i7-7700HQ 16GB DDR4 2400MHz.

After repeating a couple of times, the average is about 9 seconds. The fastest version (C) runs 5.15 seconds on my machine, so it's about 75% slower now. My guess is that mtprocs is way behind openmp implementation, besides compiler codegen itself.
« Last Edit: March 19, 2019, 02:29:04 am by Leledumbo »

Nitorami

  • Sr. Member
  • ****
  • Posts: 368
Re: Alioth Benchmark game: fannkuch-redux
« Reply #8 on: March 19, 2019, 06:40:55 pm »
Thank you !

I'm a bit disappointed; based on tests under windows I had expected to come a bit closer to C performance.

I don't think this is related to mtprocs, which I only used because it gave me a simple way to try different thread scheduing. Apart from that, the performance with mtprocs is exactly the same as when using fpc's basic thread support instead.

A few days ago, Akira1346 managed to bring fpc to the top of score in the Binary Trees benchmark, using the PasMP multiprocessing library. I might try it, but doubt it would help for fannkuch, which merely runs a few static threads over the entire program runtime.

So I guess the reason for the lower performance is ultimately fpc's code generation / optimization as such. Probably a bit more on the conservative side than C.

asdf121

  • New Member
  • *
  • Posts: 35
Re: Alioth Benchmark game: fannkuch-redux
« Reply #9 on: March 19, 2019, 07:31:36 pm »
A few days ago, Akira1346 managed to bring fpc to the top of score in the Binary Trees benchmark, using the PasMP multiprocessing library. I might try it, but doubt it would help for fannkuch, which merely runs a few static threads over the entire program runtime.
I don't think that this will help because PasMP provides locking methods which are useful when several threads are working concurrently on the same object. But seems this is not the case on your example...
Maybe it's even faster to not generate too many threads and instead use a fewer number.

So I guess the reason for the lower performance is ultimately fpc's code generation / optimization as such. Probably a bit more on the conservative side than C.
Guess its because OpenMP generates vector asm code and other things for the for-loops which are not supported in FPC or at least not in the used FPC 3.0.4 they used in benchmarks.

But doing such stuff is not trivial, you might use https://www.godbolt.org/ and try several different types of writing it and look at the generated asm code. Sometimes better code gets generated if you use a 'repeat..until' instead of a 'while' loop (not sure, just an example). Another thing you should think about cache using etc.