Recent

Author Topic: [CLOSED-Possible bug in RTL PopCnt] - Critical question about RTL performance  (Read 5212 times)

julkas

  • Sr. Member
  • ****
  • Posts: 416
  • KISS principle / Lazarus 2.0.0 / FPC 3.0.4
Same code on same platform - Intel Xeon E3-1220 v5 (FPC 3.0.0 - I can't change compiler!)
Using RTL PopCnt - 15s.
Using ASM GetBitCount - 2.3s.
Please advice!
Thanks.
Code: Pascal  [Select]
  1. {$ASMMODE INTEL}
  2. function GetBitCount(num: QWORD): integer; assembler; nostackframe;
  3. asm
  4.   POPCNT rax, num
  5. end;
  6.  
P.S. Same algo in pure CLANG using __builtin_popcountll - 1.9s
« Last Edit: June 26, 2019, 03:38:56 pm by julkas »
procedure mulu64(a, b: QWORD; out clo, chi: QWORD); assembler;
asm
  mov rax, a
  mov rdx, b
  mul rdx
  mov [clo], rax
  mov [chi], rdx
end;

Laksen

  • Hero Member
  • *****
  • Posts: 621
    • J-Software
Re: Critical question about RTL performance
« Reply #1 on: June 23, 2019, 03:59:17 pm »
Show us your code please  :)

Thaddy

  • Hero Member
  • *****
  • Posts: 9187
Re: Critical question about RTL performance
« Reply #2 on: June 23, 2019, 03:59:58 pm »
Popcnrt is already there, no need for asm. https://www.freepascal.org/docs-html/rtl/system/popcnt.html Docs are there to help you..... There is no slowdown except in -O-
« Last Edit: June 23, 2019, 04:01:40 pm by Thaddy »
also related to equus asinus.

julkas

  • Sr. Member
  • ****
  • Posts: 416
  • KISS principle / Lazarus 2.0.0 / FPC 3.0.4
Re: Critical question about RTL performance
« Reply #3 on: June 23, 2019, 04:06:38 pm »
Popcnrt is already there, no need for asm. https://www.freepascal.org/docs-html/rtl/system/popcnt.html Docs are there to help you..... There is no slowdown.
Yes, exactly. If I use PopCnt RTL function - my algo runs in 15s.
With ASM - 2.3s.
What do you mean - There is no slowdown?
procedure mulu64(a, b: QWORD; out clo, chi: QWORD); assembler;
asm
  mov rax, a
  mov rdx, b
  mul rdx
  mov [clo], rax
  mov [chi], rdx
end;

julkas

  • Sr. Member
  • ****
  • Posts: 416
  • KISS principle / Lazarus 2.0.0 / FPC 3.0.4
Re: Critical question about RTL performance
« Reply #4 on: June 23, 2019, 04:24:14 pm »
Show us your code please  :)
One line, one place change PopCnt to GetBitCount.
procedure mulu64(a, b: QWORD; out clo, chi: QWORD); assembler;
asm
  mov rax, a
  mov rdx, b
  mul rdx
  mov [clo], rax
  mov [chi], rdx
end;

ccrause

  • Full Member
  • ***
  • Posts: 185
Re: Critical question about RTL performance
« Reply #5 on: June 23, 2019, 04:29:55 pm »
On my laptop (Celeron N3050 with popcnt) the compiler by default fall back to the generic fpc_popcnt_xxx function.  If I specify -CpCOREI when compiling the test the popcnt instruction is generated.  It seems that the compiler classify my CPU as ATHLON64.

Thaddy

  • Hero Member
  • *****
  • Posts: 9187
Re: Critical question about RTL performance
« Reply #6 on: June 23, 2019, 04:39:01 pm »
specify -CfAVX or -CfAVX2 and -CpCOREAVX2
FPC is conservative so the default is AMD64
« Last Edit: June 23, 2019, 04:42:25 pm by Thaddy »
also related to equus asinus.

julkas

  • Sr. Member
  • ****
  • Posts: 416
  • KISS principle / Lazarus 2.0.0 / FPC 3.0.4
Re: Critical question about RTL performance
« Reply #7 on: June 23, 2019, 05:49:43 pm »
I can't change, modify compiler - linker options (are predefined). Can it be done in-line (in source code)?
Thanks.
« Last Edit: June 23, 2019, 05:57:55 pm by julkas »
procedure mulu64(a, b: QWORD; out clo, chi: QWORD); assembler;
asm
  mov rax, a
  mov rdx, b
  mul rdx
  mov [clo], rax
  mov [chi], rdx
end;

Thaddy

  • Hero Member
  • *****
  • Posts: 9187
Re: Critical question about RTL performance
« Reply #8 on: June 23, 2019, 06:05:25 pm »
No. You need the compiler options. Why can't you change them?
In Lazarus you can add it to the compile options.
(Although such tests should be done with the raw compiler!!!)
« Last Edit: June 23, 2019, 06:07:04 pm by Thaddy »
also related to equus asinus.

julkas

  • Sr. Member
  • ****
  • Posts: 416
  • KISS principle / Lazarus 2.0.0 / FPC 3.0.4
Re: Critical question about RTL performance
« Reply #9 on: June 23, 2019, 06:12:46 pm »
No. You need the compiler options. Why can't you change them?
In Lazarus you can add it to the compile options.
(Although such tests should be done with the raw compiler!!!)
Because I am talking about online judge.
P.S. My code runs in 7s with soft PopCount bellow. What's going on?
Code: Pascal  [Select]
  1. function PopCount(x: QWORD): SizeInt; inline;
  2. begin
  3.   Result := 0;
  4.   while x <> 0 do
  5.   begin
  6.     Inc(Result);
  7.     x := x and (x - 1);
  8.   end;
  9. end;
« Last Edit: June 24, 2019, 11:29:03 am by julkas »
procedure mulu64(a, b: QWORD; out clo, chi: QWORD); assembler;
asm
  mov rax, a
  mov rdx, b
  mul rdx
  mov [clo], rax
  mov [chi], rdx
end;

ASerge

  • Hero Member
  • *****
  • Posts: 1411
Re: [CLOSED] Critical question about RTL performance
« Reply #10 on: June 24, 2019, 03:51:21 pm »
Same code on same platform - Intel Xeon E3-1220 v5 (FPC 3.0.0 - I can't change compiler!)
Using RTL PopCnt - 15s.
Using ASM GetBitCount - 2.3s.
Please advice!
Thanks.
Code: Pascal  [Select]
  1. {$ASMMODE INTEL}
  2. function GetBitCount(num: QWORD): integer; assembler; nostackframe;
  3. asm
  4.   POPCNT rax, num
  5. end;
  6.  
I think you should open a bug that the compiler does not inline the popcnt command for processors that have this capability.
And your function better write as:
Code: Pascal  [Select]
  1. {$ASMMODE INTEL}
  2. function GetBitCount(const Num: QWord): QWord; assembler; nostackframe;
  3. asm
  4.   popcnt @Result, Num
  5. end;
  6.  

julkas

  • Sr. Member
  • ****
  • Posts: 416
  • KISS principle / Lazarus 2.0.0 / FPC 3.0.4
Re: [CONTINUE] Critical question about RTL performance
« Reply #11 on: June 24, 2019, 07:21:07 pm »
BsfQWORD, BsrQWORD are working as expected at full speed (same environment). Something going wrong with PopCnt.
BTW PopCnt is overloaded, Bsf, Bsr family - not.
Here is original call to RTL PopCnt - time ~ 15s.
Code: Pascal  [Select]
  1. Inc(c, PopCnt(a.bs[i] and b.bs[i]));

I made little change, same time
Code: Pascal  [Select]
  1. var
  2.   t: QWORD;
  3. ...
  4.   t := a.bs[i] and b.bs[i];
  5.   t := PopCnt(t);
  6.   Inc(c, t);

And now black magic - another change. Time ~ 6.5s.
Code: Pascal  [Select]
  1. var
  2.   t: QWORD;
  3. ...
  4.   t := a.bs[i] and b.bs[i];
  5.   if t <> 0 then t := PopCnt(t);
  6.   Inc(c, t);

Thanks @ccrause , @ASerge.
« Last Edit: June 25, 2019, 10:16:49 am by julkas »
procedure mulu64(a, b: QWORD; out clo, chi: QWORD); assembler;
asm
  mov rax, a
  mov rdx, b
  mul rdx
  mov [clo], rax
  mov [chi], rdx
end;

PascalDragon

  • Hero Member
  • *****
  • Posts: 674
  • Compiler Developer
Re: [CLOSED] Critical question about RTL performance
« Reply #12 on: June 25, 2019, 09:46:13 am »
I think you should open a bug that the compiler does not inline the popcnt command for processors that have this capability.
There is no bug. The default CPU type for x86_64 is Athon64 and those don't have POPCNT thus the generic code will be used. If the CPU type is set to something newer then the compiler will happily insert the instruction.

julkas

  • Sr. Member
  • ****
  • Posts: 416
  • KISS principle / Lazarus 2.0.0 / FPC 3.0.4
Re: [CLOSED] Critical question about RTL performance
« Reply #13 on: June 25, 2019, 10:26:13 am »
I think you should open a bug that the compiler does not inline the popcnt command for processors that have this capability.
There is no bug. The default CPU type for x86_64 is Athon64 and those don't have POPCNT thus the generic code will be used. If the CPU type is set to something newer then the compiler will happily insert the instruction.
May be. But this does not explain strange behavior with if t <> 0 then t := PopCnt(t);
procedure mulu64(a, b: QWORD; out clo, chi: QWORD); assembler;
asm
  mov rax, a
  mov rdx, b
  mul rdx
  mov [clo], rax
  mov [chi], rdx
end;

Thaddy

  • Hero Member
  • *****
  • Posts: 9187
Re: [CLOSED] Critical question about RTL performance
« Reply #14 on: June 25, 2019, 10:33:46 am »
May be. But this does not explain strange behavior with if t <> 0 then t := PopCnt(t);
The comparison test is a single cycle on most cpu's , popcnt takes at least three, so testing for zero optimizes a little here.
But as PascalDragon ( and I too ) already wrote: set the CPU to something less conservative.
also related to equus asinus.