# Lazarus

## Free Pascal => General => Topic started by: Okoba on May 14, 2022, 08:21:56 am

Title: Optimizing the counter code
Post by: Okoba on May 14, 2022, 08:21:56 am
Hello,

I have a simple code checking for a value.
In the first test, it does everything in one go, but I want to update the counter an skip some values and report status.
If I do the update of "I" in the function, everything is fine, but If I call a function to check even if inline, because of "var" parameter, FPC does not optimize the code.
I like to know if this is a proper response of FPC or it is an optimization missed.

This is a very simplified case of the problem, in the real work, I need to change Value of I depending the results. So I can not do it in the main function, and need to call another function for code maintainability.

Tested on Trunk FPC for Win64.

Code: Pascal  [Select][+][-]
1. program project1;
2.
3. uses
4.   Classes,
5.   SysUtils;
6.
7.   function Test1(const Value: array of Integer): Int64;
8.   var
9.     I: Integer;
10.   begin
11.     Result := 0;
12.
13.     I := 0;
14.     repeat
15.       repeat
16.         if Odd(Value[I]) then
17.           Result += 1;
18.         I += 1;
19.       until I = Length(Value);
20.     until True;
21.   end;
22.
23.   function Check(var I: Integer): Boolean; inline;
24.   begin
25.     Result := True;
26.   end;
27.
28.   function Test2(const Value: array of Integer): Int64;
29.   var
30.     I: Integer;
31.   begin
32.     Result := 0;
33.
34.     I := 0;
35.     repeat
36.       repeat
37.         if Odd(Value[I]) then
38.           Result += 1;
39.         I += 1;
40.       until I = Length(Value);
41.     until Check(I);
42.   end;
43.
44.   function Test3(const Value: array of Integer): Int64;
45.   var
46.     I, TempI: Integer;
47.   begin
48.     Result := 0;
49.
50.     I := 0;
51.     repeat
52.       TempI := I;
53.       repeat
54.         if Odd(Value[TempI]) then
55.           Result += 1;
56.         TempI += 1;
57.       until TempI = Length(Value);
58.       I := TempI;
59.     until Check(I);
60.   end;
61.
62. var
63.   Value: array of Integer;
64.   Tick: QWord;
65. begin
66.   SetLength(Value, 1000000000);
67.
68.   Tick := GetTickCount64;
69.   Test1(Value);
70.   WriteLn(GetTickCount64 - Tick);
71.
72.   Tick := GetTickCount64;
73.   Test2(Value);
74.   WriteLn(GetTickCount64 - Tick);
75.
76.   Tick := GetTickCount64;
77.   Test3(Value);
78.   WriteLn(GetTickCount64 - Tick);
79.
81. end.
82.

In this test, Test2 is two times slower as FPC does not use register for I. In Test3, I made a Temp variant to tricked FPC to use register.
Title: Re: Optimizing the counter code
Post by: jamie on May 14, 2022, 04:04:23 pm
You could use the "If Value and 1 <> 0 Then" instead of using the intrinsic ODD function.

for a simple test, using the logic as indicated above and using older fpc 3.0.4 it generates a TEST asm after the AND instruction?

This is kind of extra code that isn't needed because the AND instruction already sets the flags that can be used in the branch statement following.

with fpc 3.2.0 I see that doing a simple AND logic like that does not generate that extra TEST asm but the intrinsic ODD still does.

Code: Pascal  [Select][+][-]
1. 000000010002C9A0 488d6424d8               lea    -0x28(%rsp),%rsp
2. unit1.pas:36                              If V and 1 <> 0 THen beep;
3. 000000010002C9A5 83e001                   and    \$0x1,%eax
4. 000000010002C9A8 85c0                     test   %eax,%eax
5. 000000010002C9AA 7405                     je     0x10002c9b1 <BUTTON1CLICK+17>
6.

Notice the unneeded TEST asm instruction.

with 3.2.0 this still happens with the ODD function.

Title: Re: Optimizing the counter code
Post by: Okoba on May 15, 2022, 07:09:29 am
Yes checking the oddness can be done faster. But the problem is loop speed not odd. Checking odd is just a thing to do in the test code.
Title: Re: Optimizing the counter code
Post by: 440bx on May 15, 2022, 08:18:41 am
but If I call a function to check even if inline, because of "var" parameter, FPC does not optimize the code.
I like to know if this is a proper response of FPC or it is an optimization missed.
Inspection of the generated assembly code shows that the function call is not the culprit.

The problem is related to register allocation.  For whatever reason, the compiler decided not to use a register for I but, it used a register for TempI.  The additional memory accesses seem to be the reason for Test2 being slower.

Here is the annotated assembly code:
Code: ASM  [Select][+][-]
1. .section .text.n_p\$project1_\$\$_test3\$array_of_longint\$\$int64,"x"   .section .text.n_p\$project1_\$\$_test2\$array_of_longint\$\$int64,"x"
2.         .balign 16,0x90                                                 .balign 16,0x90
3. .globl  P\$PROJECT1_\$\$_TEST3\$array_of_LONGINT\$\$INT64                .globl       P\$PROJECT1_\$\$_TEST2\$array_of_LONGINT\$\$INT64
4. P\$PROJECT1_\$\$_TEST3\$array_of_LONGINT\$\$INT64:                       P\$PROJECT1_\$\$_TEST2\$array_of_LONGINT\$\$INT64:
5. .Lc13:                                                             .Lc8:
6. .seh_proc P\$PROJECT1_\$\$_TEST3\$array_of_LONGINT\$\$INT64              .seh_proc P\$PROJECT1_\$\$_TEST2\$array_of_LONGINT\$\$INT64
7. .Ll22:                                                             .Ll13:
8. # [49] begin                                                       # [33] begin
9.         pushq   %rbp                                                    pushq   %rbp
10. .seh_pushreg %rbp                                                  .seh_pushreg %rbp
11. .Lc15:                                                             .Lc10:
12. .Lc16:                                                             .Lc11:
13.         movq    %rsp,%rbp                                               movq    %rsp,%rbp
14. .Lc17:                                                             .Lc12:
15.         leaq    -16(%rsp),%rsp                                          leaq    -16(%rsp),%rsp
16. .seh_stackalloc 16                                                 .seh_stackalloc 16
17. # Var Value located in register rcx                                # Var Value located in register rcx
18. # Var \$highVALUE located in register rdx                           # Var \$highVALUE located in register rdx
19. # Var \$result located in register rax                              # Var \$result located in register rax
20. # Var TempI located in register r9
21. .seh_endprologue                                                   .seh_endprologue
22. # Var I located at rbp-8, size=OS_S64                              # Var I located at rbp-8, size=OS_S64
23. # Var \$result located in register rax                              # Var \$result located in register rax
24. .Ll23:                                                             .Ll14:
25. # [50] Result := 0;                                                # [34] Result := 0;
26.         movq    \$0,%rax                                                 movq    \$0,%rax
27. .Ll24:                                                             .Ll15:
28. # [52] I := 0;                                                     # [36] I := 0;
29.         movq    \$0,-8(%rbp)                                             movq    \$0,-8(%rbp)
30. .Ll25:                                                             .Lj40:
31. # [54] TempI := I;                                                 .Ll16:
32.         movq    -8(%rbp),%r9  { move I to r9 }
33. .Lj66:
34. .Ll26:
35. # [56] if Odd(Value[TempI]) then                                   # [39] if Odd(Value[I]) then
36.                                                                         movq    -8(%rbp),%r8       { move I to r8 }
37.         leaq    (%rcx,%r9,4),%r8                                        movl    (%rcx,%r8,4),%r8d
38.         movl    (%r8),%r8d
39.         andl    \$1,%r8d                                                 andl    \$1,%r8d
40.         testb   %r8b,%r8b                                               testb   %r8b,%r8b
41.         je      .Lj70                                                   je      .Lj44
42. .Ll27:                                                             .Ll17:
43. # [57] Result += 1;                                                # [40] Result += 1;
44.         leaq    1(%rax),%r8                                             leaq    1(%rax),%r8
45.         movq    %r8,%rax                                                movq    %r8,%rax
46. .Lj70:                                                             .Lj44:
47. .Ll28:                                                             .Ll18:
48. # [58] TempI += 1;                                                 # [41] I += 1;
49.                                                                         movq    -8(%rbp),%r8       { additonal move }
50.         leaq    1(%r9),%r8                                              leaq    1(%r8),%r8
51.         movq    %r8,%r9                                                 movq    %r8,-8(%rbp)       { slower - 1 }
52. .Ll29:                                                             .Ll19:
53. # [59] until TempI = Length(Value);                                # [42] until I = Length(Value);
54.         leaq    1(%rdx),%r8                                             leaq    1(%rdx),%r8
55.         cmpq    %r9,%r8                                                 cmpq    -8(%rbp),%r8       { slower }
56.         jne     .Lj66                                                   jne     .Lj40
57. .Ll30:                                                             .Ll20:
58. # [60] I := TempI;
59.         movq    %r9,-8(%rbp)   { match 1 }
60. .Ll31:
61. # [62] end;                                                        # [44] end;
62.         leaq    (%rbp),%rsp                                             leaq    (%rbp),%rsp  { no call to Check since it does nothing }
63.         popq    %rbp                                                    popq    %rbp
64.         ret                                                             ret
65. .seh_endproc                                                       .seh_endproc
66. .Lc14:
67. .Lt5:
68. .Ll32:
69.
70.
Note: in the program I changed the local variables from Integer to ptrint to see if that would make a difference.  The assembly code shown above is the result of optimization -O4.

Notice that the compiler completely eliminated the call to the Check function.  Not only it inlined it, it got rid of it, as it should have since it always returns TRUE therefore it has no effect on the "repeat" statement that calls it.

HTH.

Title: Re: Optimizing the counter code
Post by: jamie on May 15, 2022, 12:11:45 pm
Hmm

You don't think it's strange the compiler is generating redundant TEST asm instruction following a AND instruction ?

the AND asm instruction already sets up the flags in the register so why is it being tested again ?

if you perform the AND inline with the user code with newer versions you'll see the TEST asm is removed and only the AND is needed but use the ODD function and it then adds the redundant TEST asm instruction.

I know the intel processor leaves a lot to be desired when it comes to flag settings after an operation. It's not like the old 6502 or related processors where almost every ASM instruction would set a flag after the fact so a simple branch could be used immediately, this saves space and processing time.
Title: Re: Optimizing the counter code
Post by: Okoba on May 15, 2022, 02:27:50 pm
I think it is an optimization opportunity (especially as I tested with gcc and it optimizes the both case), but I welcome more input.
Title: Re: Optimizing the counter code
Post by: BrunoK on May 15, 2022, 04:53:04 pm
FPC 3.2.2 Win_x86_64

I have the same time discrepancy using -O1 -OoREGVAR.

Something seems not correct in evaluating register allocation in Test2, "I" is being accessed on in stacks temporary allocation position.
In Test1 and Test3, "i" and "TempI" are register cached/allocated for the repeat loop.

Given the big difference in timings (1:>2) it might be a good idea that point to be checked (and maybe discarded if something wrong can be pointed in the code) by someone qualified.

Title: Re: Optimizing the counter code
Post by: Okoba on May 15, 2022, 05:55:12 pm
in this case this is 2X slower. But in my real case use (that I simplified for this forum post to show the point), it slows down the code 4X.
As it seems a real clear optimization point.
Title: Re: Optimizing the counter code
Post by: PascalDragon on May 16, 2022, 02:09:48 pm
The problem is likely the following (not verified): Due to the var parameter of the Check function the compiler needs to be able to take the address of the I variable. This flag prohibits that the variable can reside in a register. The code of the Check-function is then inlined and the need for taking the address of the variable is gone, but the flag that prohibits the register optimization has not been reset and thus worse code is generated.

Title: Re: Optimizing the counter code
Post by: Okoba on May 17, 2022, 03:29:00 pm
Done: https://gitlab.com/freepascal.org/fpc/source/-/issues/39725
Title: Re: Optimizing the counter code
Post by: BrunoK on May 17, 2022, 05:55:45 pm
Program with comments that tries to put in evidence the problem. To be compiled with -O1 -OoREGVAR
Code: Pascal  [Select][+][-]
1. program pgmLoopReg220517C;
2.
3. { https://forum.lazarus.freepascal.org/index.php/topic,59340.0.html?PHPSESSID=l1vtb3loamsjbth1c8l3lm80a7 }
4.
5. uses
6.   Classes, SysUtils;
7.
8. { FPC 3.2.2 Win_x86_64 (also in Win_i386)
9.
10.   Compiling using
11.   -O0 -OoREGVAR (no optimization + optimize for register variables) optimizes
12.                 effectively for both Test3 and Test4 register allocation, except "I"
13.                 is not register cached for both methods.
14.                 Interestingly, in this case Test4 is much faster than when
15.                 compiled with -O1 -OoREGVAR but generated code is not the same
16.                 for I := I + 1 (and most of the rest of the code).
17.   -O1 -OoREGVAR Same register allocations for Test3 and Test4 but I := I + 1
18.                 generates "incl   -0x4(%rbp)" that seems ununderstandably and
19.                 dramaticaly slow.
20.                 Test4 takes >2 times Test3 time for a basically simple
21.                 (and common looking) loop.
22.
23.   Removing the var in Check() makes "I" fully register cached.
24. }
25.
26.   function Check(var I: integer): boolean;
27.   begin
28.     Result := True;
29.   end;
30.
31.   function Test3(const Value: array of integer): integer;
32.   var
33.     I, TempI: integer;
34.   begin
35.     Result := 0;
36.
37.     I := 0;
38.     repeat
39.       TempI := I;
40.       repeat
41.         Inc(Result);
42.         if Value[I] <> I then // Do something apparently useful
43.           Exit;
44.         TempI += 1;
45.         // I := I + 1; // Slower
46.         I := TempI; // Faster
47.       until I = Length(Value);
48.     until Check(I); // until True;
49.   end;
50.
51.   function Test4(const Value: array of integer): integer;
52.   var
53.     I, TempI: integer;
54.     lTmpInt : integer = 1;
55.   begin
56.     Result := 0;
57.
58.     I := 0;
59.     repeat
60.       TempI := I;
61.       repeat
62.         Inc(Result);
63.         if Value[I] <> I then // Do something apparently useful
64.           Exit;
65.         TempI += 1;
66.         I := I + 1;    // Slower
67.         // I := TempI; // Faster
68.       until I = Length(Value);
69.     until Check(I); // until True; makes it as fast as test3
70.   end;
71.
72. var
73.   Value: array of integer;
74.   Tick: QWord;
75.   vRes: integer;
76.   i: integer;
77. begin
78.   SetLength(Value, 100000000{0});
79.   for i := 0 to Length(Value) - 1 do // Init array elements
80.     Value[i] := i;
81.
82.   Tick := GetTickCount64;
83.   vRes := Test3(Value);
84.   WriteLn('Faster Test3 : ': 20, GetTickCount64 - Tick: 6, vRes: 12);
85.
86.   Tick := GetTickCount64;
87.   vRes := Test4(Value);
88.   WriteLn('Slow   Test4 : ': 20, GetTickCount64 - Tick: 6, vRes: 12);
89.
91. end.

Side by side Test3 and Test4
Code: Pascal  [Select][+][-]
1. { Test3 }                                 { Test4 }
2. begin                                     begin
3. push   %rbp                               push   %rbp
4. mov    %rsp,%rbp                          mov    %rsp,%rbp
5. lea    -0x50(%rsp),%rsp                   lea    -0x50(%rsp),%rsp
6. mov    %rbx,-0x28(%rbp)                   mov    %rbx,-0x28(%rbp)
7. mov    %rdi,-0x20(%rbp)                   mov    %rdi,-0x20(%rbp)
8. mov    %rsi,-0x18(%rbp)                   mov    %rsi,-0x18(%rbp)
9. mov    %r12,-0x10(%rbp)                   mov    %r12,-0x10(%rbp)
10. mov    %rcx,%rbx                          mov    %rcx,%rbx
11. mov    %rdx,%rsi                          mov    %rdx,%rsi
12. Result := 0;                              Result := 0;
13. xor    %r12d,%r12d                        xor    %r12d,%r12d
14. I := 0;                                   I := 0;
15. movl   \$0x0,-0x4(%rbp)                    movl   \$0x0,-0x4(%rbp)
16. TempI := I;                               TempI := I;
17. mov    -0x4(%rbp),%edi                    mov    -0x4(%rbp),%edi
18. Inc(Result);                              Inc(Result);
19. inc    %r12d                              inc    %r12d
20. if Value[I] <> I then                     if Value[I] <> I then
21. movslq -0x4(%rbp),%rax                    movslq -0x4(%rbp),%rax
22. mov    (%rbx,%rax,4),%eax                 mov    (%rbx,%rax,4),%eax
23. cmp    -0x4(%rbp),%eax                    cmp    -0x4(%rbp),%eax
24. jne    0x10000178a <TEST3+90>             jne    0x10000180a <TEST4+90>
25. TempI += 1;                               TempI += 1;
26. inc    %edi                               inc    %edi
27. I := TempI; // Faster                     I := I + 1; // Slower
28. mov    %edi,-0x4(%rbp)                    incl   -0x4(%rbp)
29. until I = Length(Value);                  until I = Length(Value);
30. movslq -0x4(%rbp),%rax                    movslq -0x4(%rbp),%rax
31. lea    0x1(%rsi),%rdx                     lea    0x1(%rsi),%rdx
32. cmp    %rdx,%rax                          cmp    %rdx,%rax
33. jne    0x10000175c <TEST3+44>             jne    0x1000017dc <TEST4+44>
34. until Check(I); // until True;            until Check(I); // until True; makes it as fast as test3
35. lea    -0x4(%rbp),%rcx                    lea    -0x4(%rbp),%rcx
36. callq  0x100001720 <CHECK>                callq  0x100001720 <CHECK>
37. test   %al,%al                            test   %al,%al
38. je     0x100001759 <TEST3+41>             je     0x1000017d9 <TEST4+41>
39. end;                                      end;
40. mov    %r12d,%eax                         mov    %r12d,%eax
41. mov    -0x28(%rbp),%rbx                   mov    -0x28(%rbp),%rbx
42. mov    -0x20(%rbp),%rdi                   mov    -0x20(%rbp),%rdi
43. mov    -0x18(%rbp),%rsi                   mov    -0x18(%rbp),%rsi
44. mov    -0x10(%rbp),%r12                   mov    -0x10(%rbp),%r12
45. lea    0x0(%rbp),%rsp                     lea    0x0(%rbp),%rsp
46. pop    %rbp                               pop    %rbp
47. retq                                      retq
48.
Title: Re: Optimizing the counter code
Post by: BrunoK on May 17, 2022, 06:06:24 pm
As for optimization, putting the register caches back to the stack prior to exiting the procedure does not seem very useful, maybe I do not grasp the reason for doing it.
Title: Re: Optimizing the counter code
Post by: PascalDragon on May 18, 2022, 09:07:24 am
As for optimization, putting the register caches back to the stack prior to exiting the procedure does not seem very useful, maybe I do not grasp the reason for doing it.

It's AT&T assembler, so the movement is left-to-right (not right-to-left like in Intel assembler), thus it does not put the “register caches back to the stack”, but it restores the non-volatile registers that were modified.
Title: Re: Optimizing the counter code
Post by: BrunoK on May 18, 2022, 11:01:33 am
It's AT&T assembler, so the movement is left-to-right (not right-to-left like in Intel assembler), thus it does not put the “register caches back to the stack”, but it restores the non-volatile registers that were modified.
I was so much expecting push's and pop's that I didn't think about moving to and restoring from extended stack space ...  (sigh)
Title: Re: Optimizing the counter code
Post by: MathMan on May 18, 2022, 12:58:45 pm
@BrunoK, @PascalDragon

Looking carefully at the assembler excerpt comparing Test3 & Test4 routines I see

* in both cases var I is handled on the stack (always referenced as -0x4(%rbp))
* Test4 however does "incl -0x4(%rbp)" for Pascal "I := I+1" <= see lines 27-28 in the comparative assembler output

The latter one (Test4) is a problematic instruction for many Intel Core variants (iirc it is only gracefully handled since xxxLake) as it generates a stall in the instruction pipeline. Test3 instead does a write/read to 0x4(%rbp) which is handeld ok by the majority of Intel Core variants.

Maybe it is better to tune the optimizer to avoid such instructions and generally replace with a slightly longer "movl mem, reg / incl reg / movl reg, mem" sequence?

Cheers,
MathMan
Title: Re: Optimizing the counter code
Post by: BrunoK on May 18, 2022, 03:39:06 pm
@MathMan
Replacing
Code: Pascal  [Select][+][-]
1.         I := I + 1;
by
Code: Pascal  [Select][+][-]
1.         TempI := I;
2.         Inc(TempI);
3.         I := TempI;
makes Test4 nearly as fast as Test3.

Its like using %rax (or %eax) would be 'overheating' of being used :-)
Title: Re: Optimizing the counter code
Post by: PascalDragon on May 19, 2022, 08:52:17 am
@BrunoK, @PascalDragon

Looking carefully at the assembler excerpt comparing Test3 & Test4 routines I see

* in both cases var I is handled on the stack (always referenced as -0x4(%rbp))
* Test4 however does "incl -0x4(%rbp)" for Pascal "I := I+1" <= see lines 27-28 in the comparative assembler output

The latter one (Test4) is a problematic instruction for many Intel Core variants (iirc it is only gracefully handled since xxxLake) as it generates a stall in the instruction pipeline. Test3 instead does a write/read to 0x4(%rbp) which is handeld ok by the majority of Intel Core variants.

Maybe it is better to tune the optimizer to avoid such instructions and generally replace with a slightly longer "movl mem, reg / incl reg / movl reg, mem" sequence?

Someone report this with examples, please, and probably someone (like J. Gareth Moreton) will take a look at it.
Title: Re: Optimizing the counter code
Post by: MathMan on May 19, 2022, 12:24:52 pm
@BrunoK, @PascalDragon

Looking carefully at the assembler excerpt comparing Test3 & Test4 routines I see

* in both cases var I is handled on the stack (always referenced as -0x4(%rbp))
* Test4 however does "incl -0x4(%rbp)" for Pascal "I := I+1" <= see lines 27-28 in the comparative assembler output

The latter one (Test4) is a problematic instruction for many Intel Core variants (iirc it is only gracefully handled since xxxLake) as it generates a stall in the instruction pipeline. Test3 instead does a write/read to 0x4(%rbp) which is handeld ok by the majority of Intel Core variants.

Maybe it is better to tune the optimizer to avoid such instructions and generally replace with a slightly longer "movl mem, reg / incl reg / movl reg, mem" sequence?

Someone report this with examples, please, and probably someone (like J. Gareth Moreton) will take a look at it.

I'll try to address this. But please bear with me, as I haven't done this before and my home is an Internet-free zone - so I'll have some fiddling ahead of me ;-)
Title: Re: Optimizing the counter code
Post by: PascalDragon on May 19, 2022, 01:45:16 pm
I'll try to address this. But please bear with me, as I haven't done this before and my home is an Internet-free zone - so I'll have some fiddling ahead of me ;-)

Title: Re: Optimizing the counter code
Post by: BrunoK on May 19, 2022, 04:05:15 pm
Simplified project that keeps what I think being relevant.
3 slightly different Test routines that show alternatives.

Timings Win64 :
Code: Pascal  [Select][+][-]
1.                    ------------ -O1 -OoREGVAR------------------
2.                  Ticks  Iterations                       Timings
3.  Slow   Test2 :    219    99999999                    223.007 ms
4. Fastest Test3 :     31    99999999                     31.585 ms
5. Faster  Test4 :     63    99999999                     60.238 ms
Title: Re: Optimizing the counter code
Post by: marcov on May 19, 2022, 05:24:31 pm
I tried with FPC 3.2.2 and trunk for win32. In both cases test2 and test4 are about equal, while test3 is about 3.5 times faster on an old Ivy Bridge (core i7 3770), which is not a -lake.

Code: [Select]
`3.3.1:     Slow   Test2 :     15     9999999                     18,255 ms    Fastest Test3 :      0     9999999                      5,339 ms    Faster  Test4 :     16     9999999                     18,452 ms`
Code: [Select]
`3.2.2:     Slow   Test2 :     15     9999999                     18,297 ms    Fastest Test3 :      0     9999999                      5,390 ms    Faster  Test4 :     16     9999999                     18,520 ms`
Title: Re: Optimizing the counter code
Post by: Okoba on May 19, 2022, 05:51:33 pm
Results on Trunk FPC and Lazarus, on i9-9900K
Quote
-O0 -OoREGVAR
Slow   Test2 :     16     9999999                     16.528 ms
Fastest Test3 :      0     9999999                      8.424 ms
Faster  Test4 :     15     9999999                     15.142 ms

-O1 -OoREGVAR
Slow   Test2 :     15     9999999                     18.844 ms
Fastest Test3 :      0     9999999                      4.232 ms
Faster  Test4 :     16     9999999                     15.830 ms

-O2 -OoREGVAR
Slow   Test2 :     15     9999999                     18.834 ms
Fastest Test3 :      0     9999999                      4.430 ms
Faster  Test4 :     16     9999999                     16.207 ms

-O3 -OoREGVAR
Slow   Test2 :     16     9999999                     18.702 ms
Fastest Test3 :      0     9999999                      4.311 ms
Faster  Test4 :     15     9999999                     15.903 ms

-O4 -OoREGVAR
Slow   Test2 :     16     9999999                     18.435 ms
Fastest Test3 :      0     9999999                      4.333 ms
Faster  Test4 :     15     9999999                     15.913 ms

Lazarus Default Release Mode -OoREGVAR
Slow   Test2 :     16     9999999                     18.525 ms
Fastest Test3 :      0     9999999                      2.695 ms
Faster  Test4 :     15     9999999                     16.014 ms

Lazarus Default Release Mode
Slow   Test2 :     16     9999999                     19.037 ms
Fastest Test3 :      0     9999999                      2.696 ms
Faster  Test4 :     15     9999999                     16.096 ms
Title: Re: Optimizing the counter code
Post by: BrunoK on May 19, 2022, 06:37:48 pm
My system is
11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz   2.42 GHz

Those benchmarking issues are (always, it was easier in the dual core epoch)  very frustrating.

I downloaded and retested the application I posted on this forum and Test4 is still much faster than Test2.

Was there a bug in the application I published ?

Title: Re: Optimizing the counter code
Post by: MathMan on May 20, 2022, 11:55:02 am
@marcov:
Your test made me unsure, whether I was tricked by false memory about the issue in general. So I re-read the official Intel assembler optimization guide. Indeed the issue exists (as I remembered) - it is mentioned as "dense RMW issue" for Sandy-Bridge architecture (and former). The point is that the Loop-Streming-Detector has issues with the amount of micro-ops generated by these instruction types in dense loops.

It was however lifted (if I understood the docs correct, they are a bit vague in that respect) at least with Haswell ongoing, maybe even with Ivy Bridge.

@Okoba:
Thanks for testing again. What is puzzling me is that Lazarus default release mode is generating faster code, than e.g. -O4 -OoREGVAR. Did you compile with e.g. Range Checks enabled when not in Lazarus default release mode?

@BrunoK:
Could it be that you are testing on a Laptop? If so benchmarks can vary a lot, if not done with extreme pre-caution like fixing the clock, binding to a specific core, executing initial excessive warm-up code etc.

Takeaway - I'll do some experiments on my systems (one Nehalem, one Skylake) and see if I can pinpoint this down to a reproducible case. If so I'll add this to Okobas case in the bug tracker. Otherwise I'll do some heavy backpaddling to save face  ;)

Cheers,
MathMan
Title: Re: Optimizing the counter code
Post by: Okoba on May 20, 2022, 02:55:36 pm
@MathMan, I did not change anything with the release mode except removing -OoREGVAR from the project. \$RangeChecks seems off.

Quote
Lazarus Default Release Mode \$RangeChecks ON
Slow   Test2 :     16     9999999                     18.504 ms
Fastest Test3 :      0     9999999                      2.749 ms
Faster  Test4 :     15     9999999                     16.129 ms

Lazarus Default Release Mode \$RangeChecks OFF
Slow   Test2 :     15     9999999                     18.579 ms
Fastest Test3 :      0     9999999                      2.703 ms
Faster  Test4 :     16     9999999                     16.148 ms
Title: Re: Optimizing the counter code
Post by: bytebites on May 21, 2022, 11:35:30 am
Amd Rysen 9 3900X says (Fpc 3.3.1):
Quote
O1
Slow   Test2 :     21     9999999                     21.020 ms
Fastest Test3 :      4     9999999                      3.708 ms
Faster  Test4 :     19     9999999                     18.765 ms
O3
Slow   Test2 :     10     9999999                      9.786 ms
Fastest Test3 :      4     9999999                      3.496 ms
Faster  Test4 :      4     9999999                      4.301 ms
Title: Re: Optimizing the counter code
Post by: BrunoK on May 21, 2022, 03:51:55 pm
A synthesis of what the optimization levels are can be found in the wiki page :
https://wiki.freepascal.org/Optimization#Optimization_Switches
Note that anything compiled with O2 and above include, if not disabled, the REGVAR option by default.

AFAIK \$RangeChecks is not relevant since the array was removed to concentrate on the loop and register optimization.

Given the different results for different processors there is probably NOT MUCH ROOM for a general optimization strategy for the complete palette of i386 / x86_64 machines. My laptop that is an Intel i5-1135G7 @ 2.40GHz does have different ratio between tests from a desktop that is an i3-6100 CPU @ 3.70GHz 3.70 GHz.

bytebites’s Amd Rysen 9 3900X in O3 shows the same ratio as my laptop but other testers have ratios in the same range as my desktop.

One must notice that the initial code involved a var parameter in the check(i) function that prevented I to be register cached. The function was inline'd and apparently the compiler managed to eliminate that code in mode O4.

My opinion is that with modern processors, speed improvement in program code becomes significant only when there is at least a 15 % difference across multiple runs and methods order call.