Lazarus

Programming => General => Topic started by: julkas on June 23, 2019, 03:38:18 pm

Title: [CLOSED-Possible bug in RTL PopCnt] - Critical question about RTL performance
Post by: julkas on June 23, 2019, 03:38:18 pm
Same code on same platform - Intel Xeon E3-1220 v5 (FPC 3.0.0 - I can't change compiler!)
Using RTL PopCnt - 15s.
Using ASM GetBitCount - 2.3s.
Please advice!
Thanks.
Code: Pascal  [Select]
  1. {$ASMMODE INTEL}
  2. function GetBitCount(num: QWORD): integer; assembler; nostackframe;
  3. asm
  4.   POPCNT rax, num
  5. end;
  6.  
P.S. Same algo in pure CLANG using __builtin_popcountll - 1.9s
Title: Re: Critical question about RTL performance
Post by: Laksen on June 23, 2019, 03:59:17 pm
Show us your code please  :)
Title: Re: Critical question about RTL performance
Post by: Thaddy on June 23, 2019, 03:59:58 pm
Popcnrt is already there, no need for asm. https://www.freepascal.org/docs-html/rtl/system/popcnt.html Docs are there to help you..... There is no slowdown except in -O-
Title: Re: Critical question about RTL performance
Post by: julkas on June 23, 2019, 04:06:38 pm
Popcnrt is already there, no need for asm. https://www.freepascal.org/docs-html/rtl/system/popcnt.html Docs are there to help you..... There is no slowdown.
Yes, exactly. If I use PopCnt RTL function - my algo runs in 15s.
With ASM - 2.3s.
What do you mean - There is no slowdown?
Title: Re: Critical question about RTL performance
Post by: julkas on June 23, 2019, 04:24:14 pm
Show us your code please  :)
One line, one place change PopCnt to GetBitCount.
Title: Re: Critical question about RTL performance
Post by: ccrause on June 23, 2019, 04:29:55 pm
On my laptop (Celeron N3050 with popcnt) the compiler by default fall back to the generic fpc_popcnt_xxx function.  If I specify -CpCOREI when compiling the test the popcnt instruction is generated.  It seems that the compiler classify my CPU as ATHLON64.
Title: Re: Critical question about RTL performance
Post by: Thaddy on June 23, 2019, 04:39:01 pm
specify -CfAVX or -CfAVX2 and -CpCOREAVX2
FPC is conservative so the default is AMD64
Title: Re: Critical question about RTL performance
Post by: julkas on June 23, 2019, 05:49:43 pm
I can't change, modify compiler - linker options (are predefined). Can it be done in-line (in source code)?
Thanks.
Title: Re: Critical question about RTL performance
Post by: Thaddy on June 23, 2019, 06:05:25 pm
No. You need the compiler options. Why can't you change them?
In Lazarus you can add it to the compile options.
(Although such tests should be done with the raw compiler!!!)
Title: Re: Critical question about RTL performance
Post by: julkas on June 23, 2019, 06:12:46 pm
No. You need the compiler options. Why can't you change them?
In Lazarus you can add it to the compile options.
(Although such tests should be done with the raw compiler!!!)
Because I am talking about online judge.
P.S. My code runs in 7s with soft PopCount bellow. What's going on?
Code: Pascal  [Select]
  1. function PopCount(x: QWORD): SizeInt; inline;
  2. begin
  3.   Result := 0;
  4.   while x <> 0 do
  5.   begin
  6.     Inc(Result);
  7.     x := x and (x - 1);
  8.   end;
  9. end;
Title: Re: [CLOSED] Critical question about RTL performance
Post by: ASerge on June 24, 2019, 03:51:21 pm
Same code on same platform - Intel Xeon E3-1220 v5 (FPC 3.0.0 - I can't change compiler!)
Using RTL PopCnt - 15s.
Using ASM GetBitCount - 2.3s.
Please advice!
Thanks.
Code: Pascal  [Select]
  1. {$ASMMODE INTEL}
  2. function GetBitCount(num: QWORD): integer; assembler; nostackframe;
  3. asm
  4.   POPCNT rax, num
  5. end;
  6.  
I think you should open a bug that the compiler does not inline the popcnt command for processors that have this capability.
And your function better write as:
Code: Pascal  [Select]
  1. {$ASMMODE INTEL}
  2. function GetBitCount(const Num: QWord): QWord; assembler; nostackframe;
  3. asm
  4.   popcnt @Result, Num
  5. end;
  6.  
Title: Re: [CONTINUE] Critical question about RTL performance
Post by: julkas on June 24, 2019, 07:21:07 pm
BsfQWORD, BsrQWORD are working as expected at full speed (same environment). Something going wrong with PopCnt.
BTW PopCnt is overloaded, Bsf, Bsr family - not.
Here is original call to RTL PopCnt - time ~ 15s.
Code: Pascal  [Select]
  1. Inc(c, PopCnt(a.bs[i] and b.bs[i]));

I made little change, same time
Code: Pascal  [Select]
  1. var
  2.   t: QWORD;
  3. ...
  4.   t := a.bs[i] and b.bs[i];
  5.   t := PopCnt(t);
  6.   Inc(c, t);

And now black magic - another change. Time ~ 6.5s.
Code: Pascal  [Select]
  1. var
  2.   t: QWORD;
  3. ...
  4.   t := a.bs[i] and b.bs[i];
  5.   if t <> 0 then t := PopCnt(t);
  6.   Inc(c, t);

Thanks @ccrause , @ASerge.
Title: Re: [CLOSED] Critical question about RTL performance
Post by: PascalDragon on June 25, 2019, 09:46:13 am
I think you should open a bug that the compiler does not inline the popcnt command for processors that have this capability.
There is no bug. The default CPU type for x86_64 is Athon64 and those don't have POPCNT thus the generic code will be used. If the CPU type is set to something newer then the compiler will happily insert the instruction.
Title: Re: [CLOSED] Critical question about RTL performance
Post by: julkas on June 25, 2019, 10:26:13 am
I think you should open a bug that the compiler does not inline the popcnt command for processors that have this capability.
There is no bug. The default CPU type for x86_64 is Athon64 and those don't have POPCNT thus the generic code will be used. If the CPU type is set to something newer then the compiler will happily insert the instruction.
May be. But this does not explain strange behavior with if t <> 0 then t := PopCnt(t);
Title: Re: [CLOSED] Critical question about RTL performance
Post by: Thaddy on June 25, 2019, 10:33:46 am
May be. But this does not explain strange behavior with if t <> 0 then t := PopCnt(t);
The comparison test is a single cycle on most cpu's , popcnt takes at least three, so testing for zero optimizes a little here.
But as PascalDragon ( and I too ) already wrote: set the CPU to something less conservative.
Title: Re: [CLOSED] Critical question about RTL performance
Post by: julkas on June 25, 2019, 10:41:21 am
May be. But this does not explain strange behavior with if t <> 0 then t := PopCnt(t);
The comparison test is a single cycle on most cpu's , popcnt takes at least three, so testing for zero optimizes a little here.
Yes - a little, but not from 15s to 6.5s. It's impossible.
Title: Re: [CLOSED] Critical question about RTL performance
Post by: MathMan on June 25, 2019, 01:50:13 pm
May be. But this does not explain strange behavior with if t <> 0 then t := PopCnt(t);
The comparison test is a single cycle on most cpu's , popcnt takes at least three, so testing for zero optimizes a little here.
Yes - a little, but not from 15s to 6.5s. It's impossible.

Hi julkas - from my perspective it is very well possible, as the difference simply relates to the number of 0s detected. You seem to calculate the hamming distance over two arrays. If the arrays are held on heap and are not preset with random values prior to calculation then the run-time will be erratic, depending on the values found in the uninitialized heap. If your code uses variant arrays which get sized but again not preset prior to calculation then the array content is explicitely 0. There are other, more theoretical, explanations on the runtime difference.
Title: Re: [CLOSED] Critical question about RTL performance
Post by: julkas on June 25, 2019, 02:48:24 pm
May be. But this does not explain strange behavior with if t <> 0 then t := PopCnt(t);
The comparison test is a single cycle on most cpu's , popcnt takes at least three, so testing for zero optimizes a little here.
Yes - a little, but not from 15s to 6.5s. It's impossible.

Hi julkas - from my perspective it is very well possible, as the difference simply relates to the number of 0s detected. You seem to calculate the hamming distance over two arrays. If the arrays are held on heap and are not preset with random values prior to calculation then the run-time will be erratic, depending on the values found in the uninitialized heap. If your code uses variant arrays which get sized but again not preset prior to calculation then the array content is explicitely 0. There are other, more theoretical, explanations on the runtime difference.
@MathMan thank you for reply.
But, I have only one line of code with RTL PopCnt - time is 15s
Code: Pascal  [Select]
  1. Inc(c, PopCnt(a.bs[i] and b.bs[i]));
I replace call to RTL PopCnt with my own pure Pascal (NOT ASM) function PopCount - now time is 7s.
Code: Pascal  [Select]
  1. function PopCount(x: QWORD): SizeInt; inline;
  2. begin
  3.   Result := 0;
  4.   while x <> 0 do
  5.   begin
  6.     Inc(Result);
  7.     x := x and (x - 1);
  8.   end;
  9. end;
  10. ...
  11. Inc(c, PopCount(a.bs[i] and b.bs[i]));
Is it normal behavoir ? And finally, RTL PopCnt with if t <> 0 then t := PopCnt(t); test runs as my PopCount version (without if test).
If test excludes zero value call (PopCnt(0) = 0)!
Thanks.
Title: Re: [CONTINUE-Possible bug in RTL PopCnt] - Critical question about RTL performance
Post by: MathMan on June 25, 2019, 03:03:05 pm
Hi julkas

No you do not only have one line of code (you popcnt function) which is relevant for your measurement. You must have some surrounding code where you initialize the the arrays a & b, do the popcnt loop over the arrays etc.

Without having seen this I can not comment on whether this is "normal" behaviour or not. Pls. show the complete, compilable program of your popcnt test.

BTW - this was indeed the first question to you at the beginning of this thread.

MathMan
Title: Re: [CONTINUE-Possible bug in RTL PopCnt] - Critical question about RTL performance
Post by: julkas on June 26, 2019, 10:16:25 am
Hi julkas

No you do not only have one line of code (you popcnt function) which is relevant for your measurement. You must have some surrounding code where you initialize the the arrays a & b, do the popcnt loop over the arrays etc.

Without having seen this I can not comment on whether this is "normal" behaviour or not. Pls. show the complete, compilable program of your popcnt test.

BTW - this was indeed the first question to you at the beginning of this thread.

MathMan
Hi, @MathMan!
Link to the problem - https://www.spoj.com/problems/ADAFUROW/.

Regards.



Title: Re: [CONTINUE-Possible bug in RTL PopCnt] - Critical question about RTL performance
Post by: MathMan on June 26, 2019, 12:59:57 pm
Hi, @MathMan!
Link to the problem - https://www.spoj.com/problems/ADAFUROW/.

Regards.

Well, thanks. Unfortunately this doesn't help me to help you with your problem. Again - I can only judge if there is a real issue or everythings behaves as expected if you show me your compilable sources.

Regards,
MathMan
Title: Re: [CONTINUE-Possible bug in RTL PopCnt] - Critical question about RTL performance
Post by: julkas on June 26, 2019, 03:53:44 pm
Hi, @MathMan!
Link to the problem - https://www.spoj.com/problems/ADAFUROW/.

Regards.

Well, thanks. Unfortunately this doesn't help me to help you with your problem. Again - I can only judge if there is a real issue or everythings behaves as expected if you show me your compilable sources.

Regards,
MathMan

Thanks @MathMan. I solved this problem in C and Pascal. Link is just for reference.
So, you know my approach and you can try solve it online or generate your own test cases (random or not) for this problem and check PopCnt performance (if you want).