### Bookstore

 Computer Math and Games in Pascal (preview) Lazarus Handbook

### Author Topic: Optimizing the counter code  (Read 1375 times)

#### Okoba

• Sr. Member
• Posts: 301
##### Optimizing the counter code
« on: May 14, 2022, 08:21:56 am »
Hello,

I have a simple code checking for a value.
In the first test, it does everything in one go, but I want to update the counter an skip some values and report status.
If I do the update of "I" in the function, everything is fine, but If I call a function to check even if inline, because of "var" parameter, FPC does not optimize the code.
I like to know if this is a proper response of FPC or it is an optimization missed.

This is a very simplified case of the problem, in the real work, I need to change Value of I depending the results. So I can not do it in the main function, and need to call another function for code maintainability.

Tested on Trunk FPC for Win64.

Code: Pascal  [Select][+][-]
1. program project1;
2.
3. uses
4.   Classes,
5.   SysUtils;
6.
7.   function Test1(const Value: array of Integer): Int64;
8.   var
9.     I: Integer;
10.   begin
11.     Result := 0;
12.
13.     I := 0;
14.     repeat
15.       repeat
16.         if Odd(Value[I]) then
17.           Result += 1;
18.         I += 1;
19.       until I = Length(Value);
20.     until True;
21.   end;
22.
23.   function Check(var I: Integer): Boolean; inline;
24.   begin
25.     Result := True;
26.   end;
27.
28.   function Test2(const Value: array of Integer): Int64;
29.   var
30.     I: Integer;
31.   begin
32.     Result := 0;
33.
34.     I := 0;
35.     repeat
36.       repeat
37.         if Odd(Value[I]) then
38.           Result += 1;
39.         I += 1;
40.       until I = Length(Value);
41.     until Check(I);
42.   end;
43.
44.   function Test3(const Value: array of Integer): Int64;
45.   var
46.     I, TempI: Integer;
47.   begin
48.     Result := 0;
49.
50.     I := 0;
51.     repeat
52.       TempI := I;
53.       repeat
54.         if Odd(Value[TempI]) then
55.           Result += 1;
56.         TempI += 1;
57.       until TempI = Length(Value);
58.       I := TempI;
59.     until Check(I);
60.   end;
61.
62. var
63.   Value: array of Integer;
64.   Tick: QWord;
65. begin
66.   SetLength(Value, 1000000000);
67.
68.   Tick := GetTickCount64;
69.   Test1(Value);
70.   WriteLn(GetTickCount64 - Tick);
71.
72.   Tick := GetTickCount64;
73.   Test2(Value);
74.   WriteLn(GetTickCount64 - Tick);
75.
76.   Tick := GetTickCount64;
77.   Test3(Value);
78.   WriteLn(GetTickCount64 - Tick);
79.
81. end.
82.

In this test, Test2 is two times slower as FPC does not use register for I. In Test3, I made a Temp variant to tricked FPC to use register.

#### jamie

• Hero Member
• Posts: 4582
##### Re: Optimizing the counter code
« Reply #1 on: May 14, 2022, 04:04:23 pm »
You could use the "If Value and 1 <> 0 Then" instead of using the intrinsic ODD function.

for a simple test, using the logic as indicated above and using older fpc 3.0.4 it generates a TEST asm after the AND instruction?

This is kind of extra code that isn't needed because the AND instruction already sets the flags that can be used in the branch statement following.

with fpc 3.2.0 I see that doing a simple AND logic like that does not generate that extra TEST asm but the intrinsic ODD still does.

Code: Pascal  [Select][+][-]
1. 000000010002C9A0 488d6424d8               lea    -0x28(%rsp),%rsp
2. unit1.pas:36                              If V and 1 <> 0 THen beep;
3. 000000010002C9A5 83e001                   and    \$0x1,%eax
4. 000000010002C9A8 85c0                     test   %eax,%eax
5. 000000010002C9AA 7405                     je     0x10002c9b1 <BUTTON1CLICK+17>
6.

Notice the unneeded TEST asm instruction.

with 3.2.0 this still happens with the ODD function.

The only true wisdom is knowing you know nothing

#### Okoba

• Sr. Member
• Posts: 301
##### Re: Optimizing the counter code
« Reply #2 on: May 15, 2022, 07:09:29 am »
Yes checking the oddness can be done faster. But the problem is loop speed not odd. Checking odd is just a thing to do in the test code.

#### 440bx

• Hero Member
• Posts: 2755
##### Re: Optimizing the counter code
« Reply #3 on: May 15, 2022, 08:18:41 am »
but If I call a function to check even if inline, because of "var" parameter, FPC does not optimize the code.
I like to know if this is a proper response of FPC or it is an optimization missed.
Inspection of the generated assembly code shows that the function call is not the culprit.

The problem is related to register allocation.  For whatever reason, the compiler decided not to use a register for I but, it used a register for TempI.  The additional memory accesses seem to be the reason for Test2 being slower.

Here is the annotated assembly code:
Code: ASM  [Select][+][-]
1. .section .text.n_p\$project1_\$\$_test3\$array_of_longint\$\$int64,"x"   .section .text.n_p\$project1_\$\$_test2\$array_of_longint\$\$int64,"x"
2.         .balign 16,0x90                                                 .balign 16,0x90
3. .globl  P\$PROJECT1_\$\$_TEST3\$array_of_LONGINT\$\$INT64                .globl       P\$PROJECT1_\$\$_TEST2\$array_of_LONGINT\$\$INT64
4. P\$PROJECT1_\$\$_TEST3\$array_of_LONGINT\$\$INT64:                       P\$PROJECT1_\$\$_TEST2\$array_of_LONGINT\$\$INT64:
5. .Lc13:                                                             .Lc8:
6. .seh_proc P\$PROJECT1_\$\$_TEST3\$array_of_LONGINT\$\$INT64              .seh_proc P\$PROJECT1_\$\$_TEST2\$array_of_LONGINT\$\$INT64
7. .Ll22:                                                             .Ll13:
8. # [49] begin                                                       # [33] begin
9.         pushq   %rbp                                                    pushq   %rbp
10. .seh_pushreg %rbp                                                  .seh_pushreg %rbp
11. .Lc15:                                                             .Lc10:
12. .Lc16:                                                             .Lc11:
13.         movq    %rsp,%rbp                                               movq    %rsp,%rbp
14. .Lc17:                                                             .Lc12:
15.         leaq    -16(%rsp),%rsp                                          leaq    -16(%rsp),%rsp
16. .seh_stackalloc 16                                                 .seh_stackalloc 16
17. # Var Value located in register rcx                                # Var Value located in register rcx
18. # Var \$highVALUE located in register rdx                           # Var \$highVALUE located in register rdx
19. # Var \$result located in register rax                              # Var \$result located in register rax
20. # Var TempI located in register r9
21. .seh_endprologue                                                   .seh_endprologue
22. # Var I located at rbp-8, size=OS_S64                              # Var I located at rbp-8, size=OS_S64
23. # Var \$result located in register rax                              # Var \$result located in register rax
24. .Ll23:                                                             .Ll14:
25. # [50] Result := 0;                                                # [34] Result := 0;
26.         movq    \$0,%rax                                                 movq    \$0,%rax
27. .Ll24:                                                             .Ll15:
28. # [52] I := 0;                                                     # [36] I := 0;
29.         movq    \$0,-8(%rbp)                                             movq    \$0,-8(%rbp)
30. .Ll25:                                                             .Lj40:
31. # [54] TempI := I;                                                 .Ll16:
32.         movq    -8(%rbp),%r9  { move I to r9 }
33. .Lj66:
34. .Ll26:
35. # [56] if Odd(Value[TempI]) then                                   # [39] if Odd(Value[I]) then
36.                                                                         movq    -8(%rbp),%r8       { move I to r8 }
37.         leaq    (%rcx,%r9,4),%r8                                        movl    (%rcx,%r8,4),%r8d
38.         movl    (%r8),%r8d
39.         andl    \$1,%r8d                                                 andl    \$1,%r8d
40.         testb   %r8b,%r8b                                               testb   %r8b,%r8b
41.         je      .Lj70                                                   je      .Lj44
42. .Ll27:                                                             .Ll17:
43. # [57] Result += 1;                                                # [40] Result += 1;
44.         leaq    1(%rax),%r8                                             leaq    1(%rax),%r8
45.         movq    %r8,%rax                                                movq    %r8,%rax
46. .Lj70:                                                             .Lj44:
47. .Ll28:                                                             .Ll18:
48. # [58] TempI += 1;                                                 # [41] I += 1;
49.                                                                         movq    -8(%rbp),%r8       { additonal move }
50.         leaq    1(%r9),%r8                                              leaq    1(%r8),%r8
51.         movq    %r8,%r9                                                 movq    %r8,-8(%rbp)       { slower - 1 }
52. .Ll29:                                                             .Ll19:
53. # [59] until TempI = Length(Value);                                # [42] until I = Length(Value);
54.         leaq    1(%rdx),%r8                                             leaq    1(%rdx),%r8
55.         cmpq    %r9,%r8                                                 cmpq    -8(%rbp),%r8       { slower }
56.         jne     .Lj66                                                   jne     .Lj40
57. .Ll30:                                                             .Ll20:
58. # [60] I := TempI;
59.         movq    %r9,-8(%rbp)   { match 1 }
60. .Ll31:
61. # [62] end;                                                        # [44] end;
62.         leaq    (%rbp),%rsp                                             leaq    (%rbp),%rsp  { no call to Check since it does nothing }
63.         popq    %rbp                                                    popq    %rbp
64.         ret                                                             ret
65. .seh_endproc                                                       .seh_endproc
66. .Lc14:
67. .Lt5:
68. .Ll32:
69.
70.
Note: in the program I changed the local variables from Integer to ptrint to see if that would make a difference.  The assembly code shown above is the result of optimization -O4.

Notice that the compiler completely eliminated the call to the Check function.  Not only it inlined it, it got rid of it, as it should have since it always returns TRUE therefore it has no effect on the "repeat" statement that calls it.

HTH.

FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

#### jamie

• Hero Member
• Posts: 4582
##### Re: Optimizing the counter code
« Reply #4 on: May 15, 2022, 12:11:45 pm »
Hmm

You don't think it's strange the compiler is generating redundant TEST asm instruction following a AND instruction ?

the AND asm instruction already sets up the flags in the register so why is it being tested again ?

if you perform the AND inline with the user code with newer versions you'll see the TEST asm is removed and only the AND is needed but use the ODD function and it then adds the redundant TEST asm instruction.

I know the intel processor leaves a lot to be desired when it comes to flag settings after an operation. It's not like the old 6502 or related processors where almost every ASM instruction would set a flag after the fact so a simple branch could be used immediately, this saves space and processing time.
The only true wisdom is knowing you know nothing

#### Okoba

• Sr. Member
• Posts: 301
##### Re: Optimizing the counter code
« Reply #5 on: May 15, 2022, 02:27:50 pm »
I think it is an optimization opportunity (especially as I tested with gcc and it optimizes the both case), but I welcome more input.

#### BrunoK

• Sr. Member
• Posts: 288
• Retired programmer
##### Re: Optimizing the counter code
« Reply #6 on: May 15, 2022, 04:53:04 pm »
FPC 3.2.2 Win_x86_64

I have the same time discrepancy using -O1 -OoREGVAR.

Something seems not correct in evaluating register allocation in Test2, "I" is being accessed on in stacks temporary allocation position.
In Test1 and Test3, "i" and "TempI" are register cached/allocated for the repeat loop.

Given the big difference in timings (1:>2) it might be a good idea that point to be checked (and maybe discarded if something wrong can be pointed in the code) by someone qualified.

#### Okoba

• Sr. Member
• Posts: 301
##### Re: Optimizing the counter code
« Reply #7 on: May 15, 2022, 05:55:12 pm »
in this case this is 2X slower. But in my real case use (that I simplified for this forum post to show the point), it slows down the code 4X.
As it seems a real clear optimization point.

#### PascalDragon

• Hero Member
• Posts: 4008
• Compiler Developer
##### Re: Optimizing the counter code
« Reply #8 on: May 16, 2022, 02:09:48 pm »
The problem is likely the following (not verified): Due to the var parameter of the Check function the compiler needs to be able to take the address of the I variable. This flag prohibits that the variable can reside in a register. The code of the Check-function is then inlined and the need for taking the address of the variable is gone, but the flag that prohibits the register optimization has not been reset and thus worse code is generated.

#### Okoba

• Sr. Member
• Posts: 301
##### Re: Optimizing the counter code
« Reply #9 on: May 17, 2022, 03:29:00 pm »

#### BrunoK

• Sr. Member
• Posts: 288
• Retired programmer
##### Re: Optimizing the counter code
« Reply #10 on: May 17, 2022, 05:55:45 pm »
Program with comments that tries to put in evidence the problem. To be compiled with -O1 -OoREGVAR
Code: Pascal  [Select][+][-]
1. program pgmLoopReg220517C;
2.
3. { https://forum.lazarus.freepascal.org/index.php/topic,59340.0.html?PHPSESSID=l1vtb3loamsjbth1c8l3lm80a7 }
4.
5. uses
6.   Classes, SysUtils;
7.
8. { FPC 3.2.2 Win_x86_64 (also in Win_i386)
9.
10.   Compiling using
11.   -O0 -OoREGVAR (no optimization + optimize for register variables) optimizes
12.                 effectively for both Test3 and Test4 register allocation, except "I"
13.                 is not register cached for both methods.
14.                 Interestingly, in this case Test4 is much faster than when
15.                 compiled with -O1 -OoREGVAR but generated code is not the same
16.                 for I := I + 1 (and most of the rest of the code).
17.   -O1 -OoREGVAR Same register allocations for Test3 and Test4 but I := I + 1
18.                 generates "incl   -0x4(%rbp)" that seems ununderstandably and
19.                 dramaticaly slow.
20.                 Test4 takes >2 times Test3 time for a basically simple
21.                 (and common looking) loop.
22.
23.   Removing the var in Check() makes "I" fully register cached.
24. }
25.
26.   function Check(var I: integer): boolean;
27.   begin
28.     Result := True;
29.   end;
30.
31.   function Test3(const Value: array of integer): integer;
32.   var
33.     I, TempI: integer;
34.   begin
35.     Result := 0;
36.
37.     I := 0;
38.     repeat
39.       TempI := I;
40.       repeat
41.         Inc(Result);
42.         if Value[I] <> I then // Do something apparently useful
43.           Exit;
44.         TempI += 1;
45.         // I := I + 1; // Slower
46.         I := TempI; // Faster
47.       until I = Length(Value);
48.     until Check(I); // until True;
49.   end;
50.
51.   function Test4(const Value: array of integer): integer;
52.   var
53.     I, TempI: integer;
54.     lTmpInt : integer = 1;
55.   begin
56.     Result := 0;
57.
58.     I := 0;
59.     repeat
60.       TempI := I;
61.       repeat
62.         Inc(Result);
63.         if Value[I] <> I then // Do something apparently useful
64.           Exit;
65.         TempI += 1;
66.         I := I + 1;    // Slower
67.         // I := TempI; // Faster
68.       until I = Length(Value);
69.     until Check(I); // until True; makes it as fast as test3
70.   end;
71.
72. var
73.   Value: array of integer;
74.   Tick: QWord;
75.   vRes: integer;
76.   i: integer;
77. begin
78.   SetLength(Value, 100000000{0});
79.   for i := 0 to Length(Value) - 1 do // Init array elements
80.     Value[i] := i;
81.
82.   Tick := GetTickCount64;
83.   vRes := Test3(Value);
84.   WriteLn('Faster Test3 : ': 20, GetTickCount64 - Tick: 6, vRes: 12);
85.
86.   Tick := GetTickCount64;
87.   vRes := Test4(Value);
88.   WriteLn('Slow   Test4 : ': 20, GetTickCount64 - Tick: 6, vRes: 12);
89.
91. end.

Side by side Test3 and Test4
Code: Pascal  [Select][+][-]
1. { Test3 }                                 { Test4 }
2. begin                                     begin
3. push   %rbp                               push   %rbp
4. mov    %rsp,%rbp                          mov    %rsp,%rbp
5. lea    -0x50(%rsp),%rsp                   lea    -0x50(%rsp),%rsp
6. mov    %rbx,-0x28(%rbp)                   mov    %rbx,-0x28(%rbp)
7. mov    %rdi,-0x20(%rbp)                   mov    %rdi,-0x20(%rbp)
8. mov    %rsi,-0x18(%rbp)                   mov    %rsi,-0x18(%rbp)
9. mov    %r12,-0x10(%rbp)                   mov    %r12,-0x10(%rbp)
10. mov    %rcx,%rbx                          mov    %rcx,%rbx
11. mov    %rdx,%rsi                          mov    %rdx,%rsi
12. Result := 0;                              Result := 0;
13. xor    %r12d,%r12d                        xor    %r12d,%r12d
14. I := 0;                                   I := 0;
15. movl   \$0x0,-0x4(%rbp)                    movl   \$0x0,-0x4(%rbp)
16. TempI := I;                               TempI := I;
17. mov    -0x4(%rbp),%edi                    mov    -0x4(%rbp),%edi
18. Inc(Result);                              Inc(Result);
19. inc    %r12d                              inc    %r12d
20. if Value[I] <> I then                     if Value[I] <> I then
21. movslq -0x4(%rbp),%rax                    movslq -0x4(%rbp),%rax
22. mov    (%rbx,%rax,4),%eax                 mov    (%rbx,%rax,4),%eax
23. cmp    -0x4(%rbp),%eax                    cmp    -0x4(%rbp),%eax
24. jne    0x10000178a <TEST3+90>             jne    0x10000180a <TEST4+90>
25. TempI += 1;                               TempI += 1;
26. inc    %edi                               inc    %edi
27. I := TempI; // Faster                     I := I + 1; // Slower
28. mov    %edi,-0x4(%rbp)                    incl   -0x4(%rbp)
29. until I = Length(Value);                  until I = Length(Value);
30. movslq -0x4(%rbp),%rax                    movslq -0x4(%rbp),%rax
31. lea    0x1(%rsi),%rdx                     lea    0x1(%rsi),%rdx
32. cmp    %rdx,%rax                          cmp    %rdx,%rax
33. jne    0x10000175c <TEST3+44>             jne    0x1000017dc <TEST4+44>
34. until Check(I); // until True;            until Check(I); // until True; makes it as fast as test3
35. lea    -0x4(%rbp),%rcx                    lea    -0x4(%rbp),%rcx
36. callq  0x100001720 <CHECK>                callq  0x100001720 <CHECK>
37. test   %al,%al                            test   %al,%al
38. je     0x100001759 <TEST3+41>             je     0x1000017d9 <TEST4+41>
39. end;                                      end;
40. mov    %r12d,%eax                         mov    %r12d,%eax
41. mov    -0x28(%rbp),%rbx                   mov    -0x28(%rbp),%rbx
42. mov    -0x20(%rbp),%rdi                   mov    -0x20(%rbp),%rdi
43. mov    -0x18(%rbp),%rsi                   mov    -0x18(%rbp),%rsi
44. mov    -0x10(%rbp),%r12                   mov    -0x10(%rbp),%r12
45. lea    0x0(%rbp),%rsp                     lea    0x0(%rbp),%rsp
46. pop    %rbp                               pop    %rbp
47. retq                                      retq
48.

#### BrunoK

• Sr. Member
• Posts: 288
• Retired programmer
##### Re: Optimizing the counter code
« Reply #11 on: May 17, 2022, 06:06:24 pm »
As for optimization, putting the register caches back to the stack prior to exiting the procedure does not seem very useful, maybe I do not grasp the reason for doing it.

#### PascalDragon

• Hero Member
• Posts: 4008
• Compiler Developer
##### Re: Optimizing the counter code
« Reply #12 on: May 18, 2022, 09:07:24 am »
As for optimization, putting the register caches back to the stack prior to exiting the procedure does not seem very useful, maybe I do not grasp the reason for doing it.

It's AT&T assembler, so the movement is left-to-right (not right-to-left like in Intel assembler), thus it does not put the “register caches back to the stack”, but it restores the non-volatile registers that were modified.

#### BrunoK

• Sr. Member
• Posts: 288
• Retired programmer
##### Re: Optimizing the counter code
« Reply #13 on: May 18, 2022, 11:01:33 am »
It's AT&T assembler, so the movement is left-to-right (not right-to-left like in Intel assembler), thus it does not put the “register caches back to the stack”, but it restores the non-volatile registers that were modified.
I was so much expecting push's and pop's that I didn't think about moving to and restoring from extended stack space ...  (sigh)

#### MathMan

• Sr. Member
• Posts: 256
##### Re: Optimizing the counter code
« Reply #14 on: May 18, 2022, 12:58:45 pm »
@BrunoK, @PascalDragon

Looking carefully at the assembler excerpt comparing Test3 & Test4 routines I see

* in both cases var I is handled on the stack (always referenced as -0x4(%rbp))
* Test4 however does "incl -0x4(%rbp)" for Pascal "I := I+1" <= see lines 27-28 in the comparative assembler output

The latter one (Test4) is a problematic instruction for many Intel Core variants (iirc it is only gracefully handled since xxxLake) as it generates a stall in the instruction pipeline. Test3 instead does a write/read to 0x4(%rbp) which is handeld ok by the majority of Intel Core variants.

Maybe it is better to tune the optimizer to avoid such instructions and generally replace with a slightly longer "movl mem, reg / incl reg / movl reg, mem" sequence?

Cheers,
MathMan