Print Page - Optimizing the counter code

Free Pascal => General => Topic started by: Okoba on May 14, 2022, 08:21:56 am

Title: Optimizing the counter code
Post by: Okoba on May 14, 2022, 08:21:56 am

Hello,

I have a simple code checking for a value.
In the first test, it does everything in one go, but I want to update the counter an skip some values and report status.
If I do the update of "I" in the function, everything is fine, but If I call a function to check even if inline, because of "var" parameter, FPC does not optimize the code.
I like to know if this is a proper response of FPC or it is an optimization missed.

This is a very simplified case of the problem, in the real work, I need to change Value of I depending the results. So I can not do it in the main function, and need to call another function for code maintainability.

Tested on Trunk FPC for Win64.

Code: Pascal [Select][+]

program project1;
 
uses
  Classes,
  SysUtils;
 
  function Test1(const Value: array of Integer): Int64;
  var
    I: Integer;
  begin
    Result := 0;
 
    I := 0;
    repeat
      repeat
        if Odd(Value[I]) then
          Result += 1;
        I += 1;
      until I = Length(Value);
    until True;
  end;
 
  function Check(var I: Integer): Boolean; inline;
  begin
    Result := True;
  end;
 
  function Test2(const Value: array of Integer): Int64;
  var
    I: Integer;
  begin
    Result := 0;
 
    I := 0;
    repeat
      repeat
        if Odd(Value[I]) then
          Result += 1;
        I += 1;
      until I = Length(Value);
    until Check(I);
  end;
 
  function Test3(const Value: array of Integer): Int64;
  var
    I, TempI: Integer;
  begin
    Result := 0;
 
    I := 0;
    repeat
      TempI := I;
      repeat
        if Odd(Value[TempI]) then
          Result += 1;
        TempI += 1;
      until TempI = Length(Value);
      I := TempI;
    until Check(I);
  end;
 
var
  Value: array of Integer;
  Tick: QWord;
begin
  SetLength(Value, 1000000000);
 
  Tick := GetTickCount64;
  Test1(Value);
  WriteLn(GetTickCount64 - Tick);
 
  Tick := GetTickCount64;
  Test2(Value);
  WriteLn(GetTickCount64 - Tick);
 
  Tick := GetTickCount64;
  Test3(Value);
  WriteLn(GetTickCount64 - Tick);
 
  ReadLn;
end.
 

In this test, Test2 is two times slower as FPC does not use register for I. In Test3, I made a Temp variant to tricked FPC to use register.

Title: Re: Optimizing the counter code
Post by: jamie on May 14, 2022, 04:04:23 pm

You could use the "If Value and 1 <> 0 Then" instead of using the intrinsic ODD function.

for a simple test, using the logic as indicated above and using older fpc 3.0.4 it generates a TEST asm after the AND instruction?

This is kind of extra code that isn't needed because the AND instruction already sets the flags that can be used in the branch statement following.

with fpc 3.2.0 I see that doing a simple AND logic like that does not generate that extra TEST asm but the intrinsic ODD still does.

Code: Pascal [Select][+]

000000010002C9A0 488d6424d8               lea    -0x28(%rsp),%rsp
unit1.pas:36                              If V and 1 <> 0 THen beep;
000000010002C9A5 83e001                   and    $0x1,%eax
000000010002C9A8 85c0                     test   %eax,%eax
000000010002C9AA 7405                     je     0x10002c9b1 <BUTTON1CLICK+17>
 

Notice the unneeded TEST asm instruction.

with 3.2.0 this still happens with the ODD function.

Title: Re: Optimizing the counter code
Post by: Okoba on May 15, 2022, 07:09:29 am

Yes checking the oddness can be done faster. But the problem is loop speed not odd. Checking odd is just a thing to do in the test code.

Title: Re: Optimizing the counter code
Post by: 440bx on May 15, 2022, 08:18:41 am

Quote from: Okoba on May 14, 2022, 08:21:56 am

but If I call a function to check even if inline, because of "var" parameter, FPC does not optimize the code.
I like to know if this is a proper response of FPC or it is an optimization missed.

Inspection of the generated assembly code shows that the function call is not the culprit.

The problem is related to register allocation. For whatever reason, the compiler decided not to use a register for I but, it used a register for TempI. The additional memory accesses seem to be the reason for Test2 being slower.

Here is the annotated assembly code:

Code: ASM [Select][+]

.section .text.n_p$project1_$$_test3$array_of_longint$$int64,"x"   .section .text.n_p$project1_$$_test2$array_of_longint$$int64,"x"
        .balign 16,0x90                                                 .balign 16,0x90
.globl  P$PROJECT1_$$_TEST3$array_of_LONGINT$$INT64                .globl       P$PROJECT1_$$_TEST2$array_of_LONGINT$$INT64
P$PROJECT1_$$_TEST3$array_of_LONGINT$$INT64:                       P$PROJECT1_$$_TEST2$array_of_LONGINT$$INT64:
.Lc13:                                                             .Lc8:
.seh_proc P$PROJECT1_$$_TEST3$array_of_LONGINT$$INT64              .seh_proc P$PROJECT1_$$_TEST2$array_of_LONGINT$$INT64
.Ll22:                                                             .Ll13:
# [49] begin                                                       # [33] begin
        pushq   %rbp                                                    pushq   %rbp
.seh_pushreg %rbp                                                  .seh_pushreg %rbp
.Lc15:                                                             .Lc10:
.Lc16:                                                             .Lc11:
        movq    %rsp,%rbp                                               movq    %rsp,%rbp
.Lc17:                                                             .Lc12:
        leaq    -16(%rsp),%rsp                                          leaq    -16(%rsp),%rsp
.seh_stackalloc 16                                                 .seh_stackalloc 16
# Var Value located in register rcx                                # Var Value located in register rcx
# Var $highVALUE located in register rdx                           # Var $highVALUE located in register rdx
# Var $result located in register rax                              # Var $result located in register rax
# Var TempI located in register r9
.seh_endprologue                                                   .seh_endprologue
# Var I located at rbp-8, size=OS_S64                              # Var I located at rbp-8, size=OS_S64
# Var $result located in register rax                              # Var $result located in register rax
.Ll23:                                                             .Ll14:
# [50] Result := 0;                                                # [34] Result := 0;
        movq    $0,%rax                                                 movq    $0,%rax
.Ll24:                                                             .Ll15:
# [52] I := 0;                                                     # [36] I := 0;
        movq    $0,-8(%rbp)                                             movq    $0,-8(%rbp)
.Ll25:                                                             .Lj40:
# [54] TempI := I;                                                 .Ll16:
        movq    -8(%rbp),%r9  { move I to r9 }
.Lj66:
.Ll26:
# [56] if Odd(Value[TempI]) then                                   # [39] if Odd(Value[I]) then
                                                                        movq    -8(%rbp),%r8       { move I to r8 }
        leaq    (%rcx,%r9,4),%r8                                        movl    (%rcx,%r8,4),%r8d
        movl    (%r8),%r8d
        andl    $1,%r8d                                                 andl    $1,%r8d
        testb   %r8b,%r8b                                               testb   %r8b,%r8b
        je      .Lj70                                                   je      .Lj44
.Ll27:                                                             .Ll17:
# [57] Result += 1;                                                # [40] Result += 1;
        leaq    1(%rax),%r8                                             leaq    1(%rax),%r8
        movq    %r8,%rax                                                movq    %r8,%rax
.Lj70:                                                             .Lj44:
.Ll28:                                                             .Ll18:
# [58] TempI += 1;                                                 # [41] I += 1;
                                                                        movq    -8(%rbp),%r8       { additonal move }
        leaq    1(%r9),%r8                                              leaq    1(%r8),%r8
        movq    %r8,%r9                                                 movq    %r8,-8(%rbp)       { slower - 1 }
.Ll29:                                                             .Ll19:
# [59] until TempI = Length(Value);                                # [42] until I = Length(Value);
        leaq    1(%rdx),%r8                                             leaq    1(%rdx),%r8
        cmpq    %r9,%r8                                                 cmpq    -8(%rbp),%r8       { slower }
        jne     .Lj66                                                   jne     .Lj40
.Ll30:                                                             .Ll20:
# [60] I := TempI;
        movq    %r9,-8(%rbp)   { match 1 }
.Ll31:
# [62] end;                                                        # [44] end;
        leaq    (%rbp),%rsp                                             leaq    (%rbp),%rsp  { no call to Check since it does nothing }
        popq    %rbp                                                    popq    %rbp
        ret                                                             ret
.seh_endproc                                                       .seh_endproc
.Lc14:
.Lt5:
.Ll32:
 
 

Note: in the program I changed the local variables from Integer to ptrint to see if that would make a difference. The assembly code shown above is the result of optimization -O4.

Notice that the compiler completely eliminated the call to the Check function. Not only it inlined it, it got rid of it, as it should have since it always returns TRUE therefore it has no effect on the "repeat" statement that calls it.

HTH.

Title: Re: Optimizing the counter code
Post by: jamie on May 15, 2022, 12:11:45 pm

Hmm

You don't think it's strange the compiler is generating redundant TEST asm instruction following a AND instruction ?

the AND asm instruction already sets up the flags in the register so why is it being tested again ?

if you perform the AND inline with the user code with newer versions you'll see the TEST asm is removed and only the AND is needed but use the ODD function and it then adds the redundant TEST asm instruction.

I know the intel processor leaves a lot to be desired when it comes to flag settings after an operation. It's not like the old 6502 or related processors where almost every ASM instruction would set a flag after the fact so a simple branch could be used immediately, this saves space and processing time.

Title: Re: Optimizing the counter code
Post by: Okoba on May 15, 2022, 02:27:50 pm

I think it is an optimization opportunity (especially as I tested with gcc and it optimizes the both case), but I welcome more input.

Title: Re: Optimizing the counter code
Post by: BrunoK on May 15, 2022, 04:53:04 pm

FPC 3.2.2 Win_x86_64

I have the same time discrepancy using -O1 -OoREGVAR.

Something seems not correct in evaluating register allocation in Test2, "I" is being accessed on in stacks temporary allocation position.
In Test1 and Test3, "i" and "TempI" are register cached/allocated for the repeat loop.

Given the big difference in timings (1:>2) it might be a good idea that point to be checked (and maybe discarded if something wrong can be pointed in the code) by someone qualified.

Title: Re: Optimizing the counter code
Post by: Okoba on May 15, 2022, 05:55:12 pm

in this case this is 2X slower. But in my real case use (that I simplified for this forum post to show the point), it slows down the code 4X.
As it seems a real clear optimization point.

Title: Re: Optimizing the counter code
Post by: PascalDragon on May 16, 2022, 02:09:48 pm

The problem is likely the following (not verified): Due to the var parameter of the Check function the compiler needs to be able to take the address of the I variable. This flag prohibits that the variable can reside in a register. The code of the Check-function is then inlined and the need for taking the address of the variable is gone, but the flag that prohibits the register optimization has not been reset and thus worse code is generated.

Please report a bug.

Title: Re: Optimizing the counter code
Post by: Okoba on May 17, 2022, 03:29:00 pm

Done: https://gitlab.com/freepascal.org/fpc/source/-/issues/39725

Title: Re: Optimizing the counter code
Post by: BrunoK on May 17, 2022, 05:55:45 pm

Program with comments that tries to put in evidence the problem. To be compiled with -O1 -OoREGVAR

Code: Pascal [Select][+]

program pgmLoopReg220517C;
 
{ https://forum.lazarus.freepascal.org/index.php/topic,59340.0.html?PHPSESSID=l1vtb3loamsjbth1c8l3lm80a7 }
 
uses
  Classes, SysUtils;
 
{ FPC 3.2.2 Win_x86_64 (also in Win_i386)
 
  Compiling using
  -O0 -OoREGVAR (no optimization + optimize for register variables) optimizes
                effectively for both Test3 and Test4 register allocation, except "I"
                is not register cached for both methods.
                Interestingly, in this case Test4 is much faster than when
                compiled with -O1 -OoREGVAR but generated code is not the same
                for I := I + 1 (and most of the rest of the code).
  -O1 -OoREGVAR Same register allocations for Test3 and Test4 but I := I + 1
                generates "incl   -0x4(%rbp)" that seems ununderstandably and
                dramaticaly slow.
                Test4 takes >2 times Test3 time for a basically simple
                (and common looking) loop.
 
  Removing the var in Check() makes "I" fully register cached.
}
 
  function Check(var I: integer): boolean;
  begin
    Result := True;
  end;
 
  function Test3(const Value: array of integer): integer;
  var
    I, TempI: integer;
  begin
    Result := 0;
 
    I := 0;
    repeat
      TempI := I;
      repeat
        Inc(Result);
        if Value[I] <> I then // Do something apparently useful
          Exit;
        TempI += 1;
        // I := I + 1; // Slower
        I := TempI; // Faster
      until I = Length(Value);
    until Check(I); // until True;
  end;
 
  function Test4(const Value: array of integer): integer;
  var
    I, TempI: integer;
    lTmpInt : integer = 1;
  begin
    Result := 0;
 
    I := 0;
    repeat
      TempI := I;
      repeat
        Inc(Result);
        if Value[I] <> I then // Do something apparently useful
          Exit;
        TempI += 1;
        I := I + 1;    // Slower
        // I := TempI; // Faster
      until I = Length(Value);
    until Check(I); // until True; makes it as fast as test3
  end;
 
var
  Value: array of integer;
  Tick: QWord;
  vRes: integer;
  i: integer;
begin
  SetLength(Value, 100000000{0});
  for i := 0 to Length(Value) - 1 do // Init array elements
    Value[i] := i;
 
  Tick := GetTickCount64;
  vRes := Test3(Value);
  WriteLn('Faster Test3 : ': 20, GetTickCount64 - Tick: 6, vRes: 12);
 
  Tick := GetTickCount64;
  vRes := Test4(Value);
  WriteLn('Slow   Test4 : ': 20, GetTickCount64 - Tick: 6, vRes: 12);
 
  ReadLn;
end.

Side by side Test3 and Test4

Code: Pascal [Select][+]

{ Test3 }                                 { Test4 }
begin                                     begin
push   %rbp                               push   %rbp
mov    %rsp,%rbp                          mov    %rsp,%rbp
lea    -0x50(%rsp),%rsp                   lea    -0x50(%rsp),%rsp
mov    %rbx,-0x28(%rbp)                   mov    %rbx,-0x28(%rbp)
mov    %rdi,-0x20(%rbp)                   mov    %rdi,-0x20(%rbp)
mov    %rsi,-0x18(%rbp)                   mov    %rsi,-0x18(%rbp)
mov    %r12,-0x10(%rbp)                   mov    %r12,-0x10(%rbp)
mov    %rcx,%rbx                          mov    %rcx,%rbx
mov    %rdx,%rsi                          mov    %rdx,%rsi
Result := 0;                              Result := 0;
xor    %r12d,%r12d                        xor    %r12d,%r12d
I := 0;                                   I := 0;
movl   $0x0,-0x4(%rbp)                    movl   $0x0,-0x4(%rbp)
TempI := I;                               TempI := I;
mov    -0x4(%rbp),%edi                    mov    -0x4(%rbp),%edi
Inc(Result);                              Inc(Result);
inc    %r12d                              inc    %r12d
if Value[I] <> I then                     if Value[I] <> I then
movslq -0x4(%rbp),%rax                    movslq -0x4(%rbp),%rax
mov    (%rbx,%rax,4),%eax                 mov    (%rbx,%rax,4),%eax
cmp    -0x4(%rbp),%eax                    cmp    -0x4(%rbp),%eax
jne    0x10000178a <TEST3+90>             jne    0x10000180a <TEST4+90>
TempI += 1;                               TempI += 1;
inc    %edi                               inc    %edi
I := TempI; // Faster                     I := I + 1; // Slower
mov    %edi,-0x4(%rbp)                    incl   -0x4(%rbp)
until I = Length(Value);                  until I = Length(Value);
movslq -0x4(%rbp),%rax                    movslq -0x4(%rbp),%rax
lea    0x1(%rsi),%rdx                     lea    0x1(%rsi),%rdx
cmp    %rdx,%rax                          cmp    %rdx,%rax
jne    0x10000175c <TEST3+44>             jne    0x1000017dc <TEST4+44>
until Check(I); // until True;            until Check(I); // until True; makes it as fast as test3
lea    -0x4(%rbp),%rcx                    lea    -0x4(%rbp),%rcx
callq  0x100001720 <CHECK>                callq  0x100001720 <CHECK>
test   %al,%al                            test   %al,%al
je     0x100001759 <TEST3+41>             je     0x1000017d9 <TEST4+41>
end;                                      end;
mov    %r12d,%eax                         mov    %r12d,%eax
mov    -0x28(%rbp),%rbx                   mov    -0x28(%rbp),%rbx
mov    -0x20(%rbp),%rdi                   mov    -0x20(%rbp),%rdi
mov    -0x18(%rbp),%rsi                   mov    -0x18(%rbp),%rsi
mov    -0x10(%rbp),%r12                   mov    -0x10(%rbp),%r12
lea    0x0(%rbp),%rsp                     lea    0x0(%rbp),%rsp
pop    %rbp                               pop    %rbp
retq                                      retq
 

Title: Re: Optimizing the counter code
Post by: BrunoK on May 17, 2022, 06:06:24 pm

As for optimization, putting the register caches back to the stack prior to exiting the procedure does not seem very useful, maybe I do not grasp the reason for doing it.

Title: Re: Optimizing the counter code
Post by: PascalDragon on May 18, 2022, 09:07:24 am

Quote from: BrunoK on May 17, 2022, 06:06:24 pm

As for optimization, putting the register caches back to the stack prior to exiting the procedure does not seem very useful, maybe I do not grasp the reason for doing it.

It's AT&T assembler, so the movement is left-to-right (not right-to-left like in Intel assembler), thus it does not put the “register caches back to the stack”, but it restores the non-volatile registers that were modified.

Title: Re: Optimizing the counter code
Post by: BrunoK on May 18, 2022, 11:01:33 am

Quote from: PascalDragon on May 18, 2022, 09:07:24 am

It's AT&T assembler, so the movement is left-to-right (not right-to-left like in Intel assembler), thus it does not put the “register caches back to the stack”, but it restores the non-volatile registers that were modified.

I was so much expecting push's and pop's that I didn't think about moving to and restoring from extended stack space ... (sigh)

Title: Re: Optimizing the counter code
Post by: MathMan on May 18, 2022, 12:58:45 pm

@BrunoK, @PascalDragon

Looking carefully at the assembler excerpt comparing Test3 & Test4 routines I see

* in both cases var I is handled on the stack (always referenced as -0x4(%rbp))
* Test4 however does "incl -0x4(%rbp)" for Pascal "I := I+1" <= see lines 27-28 in the comparative assembler output

The latter one (Test4) is a problematic instruction for many Intel Core variants (iirc it is only gracefully handled since xxxLake) as it generates a stall in the instruction pipeline. Test3 instead does a write/read to 0x4(%rbp) which is handeld ok by the majority of Intel Core variants.

Maybe it is better to tune the optimizer to avoid such instructions and generally replace with a slightly longer "movl mem, reg / incl reg / movl reg, mem" sequence?

Cheers,
MathMan

Title: Re: Optimizing the counter code
Post by: BrunoK on May 18, 2022, 03:39:06 pm

@MathMan
Replacing

Code: Pascal [Select][+]

        I := I + 1;

Code: Pascal [Select][+]

        TempI := I;
        Inc(TempI);
        I := TempI;

makes Test4 nearly as fast as Test3.

Its like using %rax (or %eax) would be 'overheating' of being used :-)

Title: Re: Optimizing the counter code
Post by: PascalDragon on May 19, 2022, 08:52:17 am

Quote from: MathMan on May 18, 2022, 12:58:45 pm

@BrunoK, @PascalDragon

Looking carefully at the assembler excerpt comparing Test3 & Test4 routines I see

* in both cases var I is handled on the stack (always referenced as -0x4(%rbp))
* Test4 however does "incl -0x4(%rbp)" for Pascal "I := I+1" <= see lines 27-28 in the comparative assembler output

The latter one (Test4) is a problematic instruction for many Intel Core variants (iirc it is only gracefully handled since xxxLake) as it generates a stall in the instruction pipeline. Test3 instead does a write/read to 0x4(%rbp) which is handeld ok by the majority of Intel Core variants.

Maybe it is better to tune the optimizer to avoid such instructions and generally replace with a slightly longer "movl mem, reg / incl reg / movl reg, mem" sequence?

Someone report this with examples, please, and probably someone (like J. Gareth Moreton) will take a look at it.

Title: Re: Optimizing the counter code
Post by: MathMan on May 19, 2022, 12:24:52 pm

Quote from: PascalDragon on May 19, 2022, 08:52:17 am

Quote from: MathMan on May 18, 2022, 12:58:45 pm
@BrunoK, @PascalDragon

Looking carefully at the assembler excerpt comparing Test3 & Test4 routines I see

* in both cases var I is handled on the stack (always referenced as -0x4(%rbp))
* Test4 however does "incl -0x4(%rbp)" for Pascal "I := I+1" <= see lines 27-28 in the comparative assembler output

The latter one (Test4) is a problematic instruction for many Intel Core variants (iirc it is only gracefully handled since xxxLake) as it generates a stall in the instruction pipeline. Test3 instead does a write/read to 0x4(%rbp) which is handeld ok by the majority of Intel Core variants.

Maybe it is better to tune the optimizer to avoid such instructions and generally replace with a slightly longer "movl mem, reg / incl reg / movl reg, mem" sequence?

Someone report this with examples, please, and probably someone (like J. Gareth Moreton) will take a look at it.

I'll try to address this. But please bear with me, as I haven't done this before and my home is an Internet-free zone - so I'll have some fiddling ahead of me ;-)

Title: Re: Optimizing the counter code
Post by: PascalDragon on May 19, 2022, 01:45:16 pm

Quote from: MathMan on May 19, 2022, 12:24:52 pm

I'll try to address this. But please bear with me, as I haven't done this before and my home is an Internet-free zone - so I'll have some fiddling ahead of me ;-)

Please note that Okoba already reported the initial problem here (https://gitlab.com/freepascal.org/fpc/source/-/issues/39725). So you might want to add your observations there instead of adding a completely new bug report.

Title: Re: Optimizing the counter code
Post by: BrunoK on May 19, 2022, 04:05:15 pm

Simplified project that keeps what I think being relevant.
3 slightly different Test routines that show alternatives.

Timings Win64 :

Code: Pascal [Select][+]

                   ------------ -O1 -OoREGVAR------------------ 
                 Ticks  Iterations                       Timings
 Slow   Test2 :    219    99999999                    223.007 ms
Fastest Test3 :     31    99999999                     31.585 ms
Faster  Test4 :     63    99999999                     60.238 ms

Title: Re: Optimizing the counter code
Post by: marcov on May 19, 2022, 05:24:31 pm

I tried with FPC 3.2.2 and trunk for win32. In both cases test2 and test4 are about equal, while test3 is about 3.5 times faster on an old Ivy Bridge (core i7 3770), which is not a -lake.

Code: [Select]

3.3.1:
     Slow   Test2 :     15     9999999                     18,255 ms
    Fastest Test3 :      0     9999999                      5,339 ms
    Faster  Test4 :     16     9999999                     18,452 ms

Code: [Select]

3.2.2:
     Slow   Test2 :     15     9999999                     18,297 ms
    Fastest Test3 :      0     9999999                      5,390 ms
    Faster  Test4 :     16     9999999                     18,520 ms

Title: Re: Optimizing the counter code
Post by: Okoba on May 19, 2022, 05:51:33 pm

Results on Trunk FPC and Lazarus, on i9-9900K

Quote

-O0 -OoREGVAR
Slow Test2 : 16 9999999 16.528 ms
Fastest Test3 : 0 9999999 8.424 ms
Faster Test4 : 15 9999999 15.142 ms

-O1 -OoREGVAR
Slow Test2 : 15 9999999 18.844 ms
Fastest Test3 : 0 9999999 4.232 ms
Faster Test4 : 16 9999999 15.830 ms

-O2 -OoREGVAR
Slow Test2 : 15 9999999 18.834 ms
Fastest Test3 : 0 9999999 4.430 ms
Faster Test4 : 16 9999999 16.207 ms

-O3 -OoREGVAR
Slow Test2 : 16 9999999 18.702 ms
Fastest Test3 : 0 9999999 4.311 ms
Faster Test4 : 15 9999999 15.903 ms

-O4 -OoREGVAR
Slow Test2 : 16 9999999 18.435 ms
Fastest Test3 : 0 9999999 4.333 ms
Faster Test4 : 15 9999999 15.913 ms

Lazarus Default Release Mode -OoREGVAR
Slow Test2 : 16 9999999 18.525 ms
Fastest Test3 : 0 9999999 2.695 ms
Faster Test4 : 15 9999999 16.014 ms

Lazarus Default Release Mode
Slow Test2 : 16 9999999 19.037 ms
Fastest Test3 : 0 9999999 2.696 ms
Faster Test4 : 15 9999999 16.096 ms

Title: Re: Optimizing the counter code
Post by: BrunoK on May 19, 2022, 06:37:48 pm

My system is
11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz 2.42 GHz

Those benchmarking issues are (always, it was easier in the dual core epoch) very frustrating.

I downloaded and retested the application I posted on this forum and Test4 is still much faster than Test2.

Was there a bug in the application I published ?

Title: Re: Optimizing the counter code
Post by: MathMan on May 20, 2022, 11:55:02 am

@marcov:
Your test made me unsure, whether I was tricked by false memory about the issue in general. So I re-read the official Intel assembler optimization guide. Indeed the issue exists (as I remembered) - it is mentioned as "dense RMW issue" for Sandy-Bridge architecture (and former). The point is that the Loop-Streming-Detector has issues with the amount of micro-ops generated by these instruction types in dense loops.

It was however lifted (if I understood the docs correct, they are a bit vague in that respect) at least with Haswell ongoing, maybe even with Ivy Bridge.

@Okoba:
Thanks for testing again. What is puzzling me is that Lazarus default release mode is generating faster code, than e.g. -O4 -OoREGVAR. Did you compile with e.g. Range Checks enabled when not in Lazarus default release mode?

@BrunoK:
Could it be that you are testing on a Laptop? If so benchmarks can vary a lot, if not done with extreme pre-caution like fixing the clock, binding to a specific core, executing initial excessive warm-up code etc.

Takeaway - I'll do some experiments on my systems (one Nehalem, one Skylake) and see if I can pinpoint this down to a reproducible case. If so I'll add this to Okobas case in the bug tracker. Otherwise I'll do some heavy backpaddling to save face ;)

Cheers,
MathMan

Title: Re: Optimizing the counter code
Post by: Okoba on May 20, 2022, 02:55:36 pm

@MathMan, I did not change anything with the release mode except removing -OoREGVAR from the project. $RangeChecks seems off.

Quote

Lazarus Default Release Mode $RangeChecks ON
Slow Test2 : 16 9999999 18.504 ms
Fastest Test3 : 0 9999999 2.749 ms
Faster Test4 : 15 9999999 16.129 ms

Lazarus Default Release Mode $RangeChecks OFF
Slow Test2 : 15 9999999 18.579 ms
Fastest Test3 : 0 9999999 2.703 ms
Faster Test4 : 16 9999999 16.148 ms

Title: Re: Optimizing the counter code
Post by: bytebites on May 21, 2022, 11:35:30 am

Amd Rysen 9 3900X says (Fpc 3.3.1):

Quote

O1
Slow Test2 : 21 9999999 21.020 ms
Fastest Test3 : 4 9999999 3.708 ms
Faster Test4 : 19 9999999 18.765 ms
O3
Slow Test2 : 10 9999999 9.786 ms
Fastest Test3 : 4 9999999 3.496 ms
Faster Test4 : 4 9999999 4.301 ms

Title: Re: Optimizing the counter code
Post by: BrunoK on May 21, 2022, 03:51:55 pm

A synthesis of what the optimization levels are can be found in the wiki page :
https://wiki.freepascal.org/Optimization#Optimization_Switches
Note that anything compiled with O2 and above include, if not disabled, the REGVAR option by default.

AFAIK $RangeChecks is not relevant since the array was removed to concentrate on the loop and register optimization.

Given the different results for different processors there is probably NOT MUCH ROOM for a general optimization strategy for the complete palette of i386 / x86_64 machines. My laptop that is an Intel i5-1135G7 @ 2.40GHz does have different ratio between tests from a desktop that is an i3-6100 CPU @ 3.70GHz 3.70 GHz.

bytebites’s Amd Rysen 9 3900X in O3 shows the same ratio as my laptop but other testers have ratios in the same range as my desktop.

One must notice that the initial code involved a var parameter in the check(i) function that prevented I to be register cached. The function was inline'd and apparently the compiler managed to eliminate that code in mode O4.

My opinion is that with modern processors, speed improvement in program code becomes significant only when there is at least a 15 % difference across multiple runs and methods order call.