Recent

Author Topic: Optimizing the counter code  (Read 3704 times)

Okoba

  • Hero Member
  • *****
  • Posts: 528
Optimizing the counter code
« on: May 14, 2022, 08:21:56 am »
Hello,

I have a simple code checking for a value.
In the first test, it does everything in one go, but I want to update the counter an skip some values and report status.
If I do the update of "I" in the function, everything is fine, but If I call a function to check even if inline, because of "var" parameter, FPC does not optimize the code.
I like to know if this is a proper response of FPC or it is an optimization missed.

This is a very simplified case of the problem, in the real work, I need to change Value of I depending the results. So I can not do it in the main function, and need to call another function for code maintainability.

Tested on Trunk FPC for Win64.

Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. uses
  4.   Classes,
  5.   SysUtils;
  6.  
  7.   function Test1(const Value: array of Integer): Int64;
  8.   var
  9.     I: Integer;
  10.   begin
  11.     Result := 0;
  12.  
  13.     I := 0;
  14.     repeat
  15.       repeat
  16.         if Odd(Value[I]) then
  17.           Result += 1;
  18.         I += 1;
  19.       until I = Length(Value);
  20.     until True;
  21.   end;
  22.  
  23.   function Check(var I: Integer): Boolean; inline;
  24.   begin
  25.     Result := True;
  26.   end;
  27.  
  28.   function Test2(const Value: array of Integer): Int64;
  29.   var
  30.     I: Integer;
  31.   begin
  32.     Result := 0;
  33.  
  34.     I := 0;
  35.     repeat
  36.       repeat
  37.         if Odd(Value[I]) then
  38.           Result += 1;
  39.         I += 1;
  40.       until I = Length(Value);
  41.     until Check(I);
  42.   end;
  43.  
  44.   function Test3(const Value: array of Integer): Int64;
  45.   var
  46.     I, TempI: Integer;
  47.   begin
  48.     Result := 0;
  49.  
  50.     I := 0;
  51.     repeat
  52.       TempI := I;
  53.       repeat
  54.         if Odd(Value[TempI]) then
  55.           Result += 1;
  56.         TempI += 1;
  57.       until TempI = Length(Value);
  58.       I := TempI;
  59.     until Check(I);
  60.   end;
  61.  
  62. var
  63.   Value: array of Integer;
  64.   Tick: QWord;
  65. begin
  66.   SetLength(Value, 1000000000);
  67.  
  68.   Tick := GetTickCount64;
  69.   Test1(Value);
  70.   WriteLn(GetTickCount64 - Tick);
  71.  
  72.   Tick := GetTickCount64;
  73.   Test2(Value);
  74.   WriteLn(GetTickCount64 - Tick);
  75.  
  76.   Tick := GetTickCount64;
  77.   Test3(Value);
  78.   WriteLn(GetTickCount64 - Tick);
  79.  
  80.   ReadLn;
  81. end.
  82.  

In this test, Test2 is two times slower as FPC does not use register for I. In Test3, I made a Temp variant to tricked FPC to use register.

jamie

  • Hero Member
  • *****
  • Posts: 6077
Re: Optimizing the counter code
« Reply #1 on: May 14, 2022, 04:04:23 pm »
You could use the "If Value and 1 <> 0 Then" instead of using the intrinsic ODD function.

for a simple test, using the logic as indicated above and using older fpc 3.0.4 it generates a TEST asm after the AND instruction?

 This is kind of extra code that isn't needed because the AND instruction already sets the flags that can be used in the branch statement following.

with fpc 3.2.0 I see that doing a simple AND logic like that does not generate that extra TEST asm but the intrinsic ODD still does.

Code: Pascal  [Select][+][-]
  1. 000000010002C9A0 488d6424d8               lea    -0x28(%rsp),%rsp
  2. unit1.pas:36                              If V and 1 <> 0 THen beep;
  3. 000000010002C9A5 83e001                   and    $0x1,%eax
  4. 000000010002C9A8 85c0                     test   %eax,%eax
  5. 000000010002C9AA 7405                     je     0x10002c9b1 <BUTTON1CLICK+17>
  6.  

Notice the unneeded TEST asm instruction.

with 3.2.0 this still happens with the ODD function.

The only true wisdom is knowing you know nothing

Okoba

  • Hero Member
  • *****
  • Posts: 528
Re: Optimizing the counter code
« Reply #2 on: May 15, 2022, 07:09:29 am »
Yes checking the oddness can be done faster. But the problem is loop speed not odd. Checking odd is just a thing to do in the test code.

440bx

  • Hero Member
  • *****
  • Posts: 3921
Re: Optimizing the counter code
« Reply #3 on: May 15, 2022, 08:18:41 am »
but If I call a function to check even if inline, because of "var" parameter, FPC does not optimize the code.
I like to know if this is a proper response of FPC or it is an optimization missed.
Inspection of the generated assembly code shows that the function call is not the culprit.

The problem is related to register allocation.  For whatever reason, the compiler decided not to use a register for I but, it used a register for TempI.  The additional memory accesses seem to be the reason for Test2 being slower. 

Here is the annotated assembly code:
Code: ASM  [Select][+][-]
  1. .section .text.n_p$project1_$$_test3$array_of_longint$$int64,"x"   .section .text.n_p$project1_$$_test2$array_of_longint$$int64,"x"
  2.         .balign 16,0x90                                                 .balign 16,0x90
  3. .globl  P$PROJECT1_$$_TEST3$array_of_LONGINT$$INT64                .globl       P$PROJECT1_$$_TEST2$array_of_LONGINT$$INT64
  4. P$PROJECT1_$$_TEST3$array_of_LONGINT$$INT64:                       P$PROJECT1_$$_TEST2$array_of_LONGINT$$INT64:
  5. .Lc13:                                                             .Lc8:
  6. .seh_proc P$PROJECT1_$$_TEST3$array_of_LONGINT$$INT64              .seh_proc P$PROJECT1_$$_TEST2$array_of_LONGINT$$INT64
  7. .Ll22:                                                             .Ll13:
  8. # [49] begin                                                       # [33] begin
  9.         pushq   %rbp                                                    pushq   %rbp
  10. .seh_pushreg %rbp                                                  .seh_pushreg %rbp
  11. .Lc15:                                                             .Lc10:
  12. .Lc16:                                                             .Lc11:
  13.         movq    %rsp,%rbp                                               movq    %rsp,%rbp
  14. .Lc17:                                                             .Lc12:
  15.         leaq    -16(%rsp),%rsp                                          leaq    -16(%rsp),%rsp
  16. .seh_stackalloc 16                                                 .seh_stackalloc 16
  17. # Var Value located in register rcx                                # Var Value located in register rcx
  18. # Var $highVALUE located in register rdx                           # Var $highVALUE located in register rdx
  19. # Var $result located in register rax                              # Var $result located in register rax
  20. # Var TempI located in register r9
  21. .seh_endprologue                                                   .seh_endprologue
  22. # Var I located at rbp-8, size=OS_S64                              # Var I located at rbp-8, size=OS_S64
  23. # Var $result located in register rax                              # Var $result located in register rax
  24. .Ll23:                                                             .Ll14:
  25. # [50] Result := 0;                                                # [34] Result := 0;
  26.         movq    $0,%rax                                                 movq    $0,%rax
  27. .Ll24:                                                             .Ll15:
  28. # [52] I := 0;                                                     # [36] I := 0;
  29.         movq    $0,-8(%rbp)                                             movq    $0,-8(%rbp)
  30. .Ll25:                                                             .Lj40:
  31. # [54] TempI := I;                                                 .Ll16:
  32.         movq    -8(%rbp),%r9  { move I to r9 }
  33. .Lj66:
  34. .Ll26:
  35. # [56] if Odd(Value[TempI]) then                                   # [39] if Odd(Value[I]) then
  36.                                                                         movq    -8(%rbp),%r8       { move I to r8 }
  37.         leaq    (%rcx,%r9,4),%r8                                        movl    (%rcx,%r8,4),%r8d
  38.         movl    (%r8),%r8d
  39.         andl    $1,%r8d                                                 andl    $1,%r8d
  40.         testb   %r8b,%r8b                                               testb   %r8b,%r8b
  41.         je      .Lj70                                                   je      .Lj44
  42. .Ll27:                                                             .Ll17:
  43. # [57] Result += 1;                                                # [40] Result += 1;
  44.         leaq    1(%rax),%r8                                             leaq    1(%rax),%r8
  45.         movq    %r8,%rax                                                movq    %r8,%rax
  46. .Lj70:                                                             .Lj44:
  47. .Ll28:                                                             .Ll18:
  48. # [58] TempI += 1;                                                 # [41] I += 1;
  49.                                                                         movq    -8(%rbp),%r8       { additonal move }
  50.         leaq    1(%r9),%r8                                              leaq    1(%r8),%r8
  51.         movq    %r8,%r9                                                 movq    %r8,-8(%rbp)       { slower - 1 }
  52. .Ll29:                                                             .Ll19:
  53. # [59] until TempI = Length(Value);                                # [42] until I = Length(Value);
  54.         leaq    1(%rdx),%r8                                             leaq    1(%rdx),%r8
  55.         cmpq    %r9,%r8                                                 cmpq    -8(%rbp),%r8       { slower }
  56.         jne     .Lj66                                                   jne     .Lj40
  57. .Ll30:                                                             .Ll20:
  58. # [60] I := TempI;
  59.         movq    %r9,-8(%rbp)   { match 1 }
  60. .Ll31:
  61. # [62] end;                                                        # [44] end;
  62.         leaq    (%rbp),%rsp                                             leaq    (%rbp),%rsp  { no call to Check since it does nothing }
  63.         popq    %rbp                                                    popq    %rbp
  64.         ret                                                             ret
  65. .seh_endproc                                                       .seh_endproc
  66. .Lc14:
  67. .Lt5:
  68. .Ll32:
  69.  
  70.  
Note: in the program I changed the local variables from Integer to ptrint to see if that would make a difference.  The assembly code shown above is the result of optimization -O4.

Notice that the compiler completely eliminated the call to the Check function.  Not only it inlined it, it got rid of it, as it should have since it always returns TRUE therefore it has no effect on the "repeat" statement that calls it.

HTH.

(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

jamie

  • Hero Member
  • *****
  • Posts: 6077
Re: Optimizing the counter code
« Reply #4 on: May 15, 2022, 12:11:45 pm »
Hmm

 You don't think it's strange the compiler is generating redundant TEST asm instruction following a AND instruction ?

 the AND asm instruction already sets up the flags in the register so why is it being tested again ?

 if you perform the AND inline with the user code with newer versions you'll see the TEST asm is removed and only the AND is needed but use the ODD function and it then adds the redundant TEST asm instruction.

 I know the intel processor leaves a lot to be desired when it comes to flag settings after an operation. It's not like the old 6502 or related processors where almost every ASM instruction would set a flag after the fact so a simple branch could be used immediately, this saves space and processing time.
The only true wisdom is knowing you know nothing

Okoba

  • Hero Member
  • *****
  • Posts: 528
Re: Optimizing the counter code
« Reply #5 on: May 15, 2022, 02:27:50 pm »
I think it is an optimization opportunity (especially as I tested with gcc and it optimizes the both case), but I welcome more input.

BrunoK

  • Sr. Member
  • ****
  • Posts: 452
  • Retired programmer
Re: Optimizing the counter code
« Reply #6 on: May 15, 2022, 04:53:04 pm »
FPC 3.2.2 Win_x86_64

I have the same time discrepancy using -O1 -OoREGVAR.

Something seems not correct in evaluating register allocation in Test2, "I" is being accessed on in stacks temporary allocation position.
In Test1 and Test3, "i" and "TempI" are register cached/allocated for the repeat loop.

Given the big difference in timings (1:>2) it might be a good idea that point to be checked (and maybe discarded if something wrong can be pointed in the code) by someone qualified.


Okoba

  • Hero Member
  • *****
  • Posts: 528
Re: Optimizing the counter code
« Reply #7 on: May 15, 2022, 05:55:12 pm »
in this case this is 2X slower. But in my real case use (that I simplified for this forum post to show the point), it slows down the code 4X.
As it seems a real clear optimization point.

PascalDragon

  • Hero Member
  • *****
  • Posts: 5444
  • Compiler Developer
Re: Optimizing the counter code
« Reply #8 on: May 16, 2022, 02:09:48 pm »
The problem is likely the following (not verified): Due to the var parameter of the Check function the compiler needs to be able to take the address of the I variable. This flag prohibits that the variable can reside in a register. The code of the Check-function is then inlined and the need for taking the address of the variable is gone, but the flag that prohibits the register optimization has not been reset and thus worse code is generated.

Please report a bug.

Okoba

  • Hero Member
  • *****
  • Posts: 528

BrunoK

  • Sr. Member
  • ****
  • Posts: 452
  • Retired programmer
Re: Optimizing the counter code
« Reply #10 on: May 17, 2022, 05:55:45 pm »
Program with comments that tries to put in evidence the problem. To be compiled with -O1 -OoREGVAR
Code: Pascal  [Select][+][-]
  1. program pgmLoopReg220517C;
  2.  
  3. { https://forum.lazarus.freepascal.org/index.php/topic,59340.0.html?PHPSESSID=l1vtb3loamsjbth1c8l3lm80a7 }
  4.  
  5. uses
  6.   Classes, SysUtils;
  7.  
  8. { FPC 3.2.2 Win_x86_64 (also in Win_i386)
  9.  
  10.   Compiling using
  11.   -O0 -OoREGVAR (no optimization + optimize for register variables) optimizes
  12.                 effectively for both Test3 and Test4 register allocation, except "I"
  13.                 is not register cached for both methods.
  14.                 Interestingly, in this case Test4 is much faster than when
  15.                 compiled with -O1 -OoREGVAR but generated code is not the same
  16.                 for I := I + 1 (and most of the rest of the code).
  17.   -O1 -OoREGVAR Same register allocations for Test3 and Test4 but I := I + 1
  18.                 generates "incl   -0x4(%rbp)" that seems ununderstandably and
  19.                 dramaticaly slow.
  20.                 Test4 takes >2 times Test3 time for a basically simple
  21.                 (and common looking) loop.
  22.  
  23.   Removing the var in Check() makes "I" fully register cached.
  24. }
  25.  
  26.   function Check(var I: integer): boolean;
  27.   begin
  28.     Result := True;
  29.   end;
  30.  
  31.   function Test3(const Value: array of integer): integer;
  32.   var
  33.     I, TempI: integer;
  34.   begin
  35.     Result := 0;
  36.  
  37.     I := 0;
  38.     repeat
  39.       TempI := I;
  40.       repeat
  41.         Inc(Result);
  42.         if Value[I] <> I then // Do something apparently useful
  43.           Exit;
  44.         TempI += 1;
  45.         // I := I + 1; // Slower
  46.         I := TempI; // Faster
  47.       until I = Length(Value);
  48.     until Check(I); // until True;
  49.   end;
  50.  
  51.   function Test4(const Value: array of integer): integer;
  52.   var
  53.     I, TempI: integer;
  54.     lTmpInt : integer = 1;
  55.   begin
  56.     Result := 0;
  57.  
  58.     I := 0;
  59.     repeat
  60.       TempI := I;
  61.       repeat
  62.         Inc(Result);
  63.         if Value[I] <> I then // Do something apparently useful
  64.           Exit;
  65.         TempI += 1;
  66.         I := I + 1;    // Slower
  67.         // I := TempI; // Faster
  68.       until I = Length(Value);
  69.     until Check(I); // until True; makes it as fast as test3
  70.   end;
  71.  
  72. var
  73.   Value: array of integer;
  74.   Tick: QWord;
  75.   vRes: integer;
  76.   i: integer;
  77. begin
  78.   SetLength(Value, 100000000{0});
  79.   for i := 0 to Length(Value) - 1 do // Init array elements
  80.     Value[i] := i;
  81.  
  82.   Tick := GetTickCount64;
  83.   vRes := Test3(Value);
  84.   WriteLn('Faster Test3 : ': 20, GetTickCount64 - Tick: 6, vRes: 12);
  85.  
  86.   Tick := GetTickCount64;
  87.   vRes := Test4(Value);
  88.   WriteLn('Slow   Test4 : ': 20, GetTickCount64 - Tick: 6, vRes: 12);
  89.  
  90.   ReadLn;
  91. end.

Side by side Test3 and Test4
Code: Pascal  [Select][+][-]
  1. { Test3 }                                 { Test4 }
  2. begin                                     begin
  3. push   %rbp                               push   %rbp
  4. mov    %rsp,%rbp                          mov    %rsp,%rbp
  5. lea    -0x50(%rsp),%rsp                   lea    -0x50(%rsp),%rsp
  6. mov    %rbx,-0x28(%rbp)                   mov    %rbx,-0x28(%rbp)
  7. mov    %rdi,-0x20(%rbp)                   mov    %rdi,-0x20(%rbp)
  8. mov    %rsi,-0x18(%rbp)                   mov    %rsi,-0x18(%rbp)
  9. mov    %r12,-0x10(%rbp)                   mov    %r12,-0x10(%rbp)
  10. mov    %rcx,%rbx                          mov    %rcx,%rbx
  11. mov    %rdx,%rsi                          mov    %rdx,%rsi
  12. Result := 0;                              Result := 0;
  13. xor    %r12d,%r12d                        xor    %r12d,%r12d
  14. I := 0;                                   I := 0;
  15. movl   $0x0,-0x4(%rbp)                    movl   $0x0,-0x4(%rbp)
  16. TempI := I;                               TempI := I;
  17. mov    -0x4(%rbp),%edi                    mov    -0x4(%rbp),%edi
  18. Inc(Result);                              Inc(Result);
  19. inc    %r12d                              inc    %r12d
  20. if Value[I] <> I then                     if Value[I] <> I then
  21. movslq -0x4(%rbp),%rax                    movslq -0x4(%rbp),%rax
  22. mov    (%rbx,%rax,4),%eax                 mov    (%rbx,%rax,4),%eax
  23. cmp    -0x4(%rbp),%eax                    cmp    -0x4(%rbp),%eax
  24. jne    0x10000178a <TEST3+90>             jne    0x10000180a <TEST4+90>
  25. TempI += 1;                               TempI += 1;
  26. inc    %edi                               inc    %edi
  27. I := TempI; // Faster                     I := I + 1; // Slower
  28. mov    %edi,-0x4(%rbp)                    incl   -0x4(%rbp)
  29. until I = Length(Value);                  until I = Length(Value);
  30. movslq -0x4(%rbp),%rax                    movslq -0x4(%rbp),%rax
  31. lea    0x1(%rsi),%rdx                     lea    0x1(%rsi),%rdx
  32. cmp    %rdx,%rax                          cmp    %rdx,%rax
  33. jne    0x10000175c <TEST3+44>             jne    0x1000017dc <TEST4+44>
  34. until Check(I); // until True;            until Check(I); // until True; makes it as fast as test3
  35. lea    -0x4(%rbp),%rcx                    lea    -0x4(%rbp),%rcx
  36. callq  0x100001720 <CHECK>                callq  0x100001720 <CHECK>
  37. test   %al,%al                            test   %al,%al
  38. je     0x100001759 <TEST3+41>             je     0x1000017d9 <TEST4+41>
  39. end;                                      end;
  40. mov    %r12d,%eax                         mov    %r12d,%eax
  41. mov    -0x28(%rbp),%rbx                   mov    -0x28(%rbp),%rbx
  42. mov    -0x20(%rbp),%rdi                   mov    -0x20(%rbp),%rdi
  43. mov    -0x18(%rbp),%rsi                   mov    -0x18(%rbp),%rsi
  44. mov    -0x10(%rbp),%r12                   mov    -0x10(%rbp),%r12
  45. lea    0x0(%rbp),%rsp                     lea    0x0(%rbp),%rsp
  46. pop    %rbp                               pop    %rbp
  47. retq                                      retq
  48.  

BrunoK

  • Sr. Member
  • ****
  • Posts: 452
  • Retired programmer
Re: Optimizing the counter code
« Reply #11 on: May 17, 2022, 06:06:24 pm »
As for optimization, putting the register caches back to the stack prior to exiting the procedure does not seem very useful, maybe I do not grasp the reason for doing it.

PascalDragon

  • Hero Member
  • *****
  • Posts: 5444
  • Compiler Developer
Re: Optimizing the counter code
« Reply #12 on: May 18, 2022, 09:07:24 am »
As for optimization, putting the register caches back to the stack prior to exiting the procedure does not seem very useful, maybe I do not grasp the reason for doing it.

It's AT&T assembler, so the movement is left-to-right (not right-to-left like in Intel assembler), thus it does not put the “register caches back to the stack”, but it restores the non-volatile registers that were modified.

BrunoK

  • Sr. Member
  • ****
  • Posts: 452
  • Retired programmer
Re: Optimizing the counter code
« Reply #13 on: May 18, 2022, 11:01:33 am »
It's AT&T assembler, so the movement is left-to-right (not right-to-left like in Intel assembler), thus it does not put the “register caches back to the stack”, but it restores the non-volatile registers that were modified.
I was so much expecting push's and pop's that I didn't think about moving to and restoring from extended stack space ...  (sigh)

MathMan

  • Sr. Member
  • ****
  • Posts: 325
Re: Optimizing the counter code
« Reply #14 on: May 18, 2022, 12:58:45 pm »
@BrunoK, @PascalDragon

Looking carefully at the assembler excerpt comparing Test3 & Test4 routines I see

* in both cases var I is handled on the stack (always referenced as -0x4(%rbp))
* Test4 however does "incl -0x4(%rbp)" for Pascal "I := I+1" <= see lines 27-28 in the comparative assembler output

The latter one (Test4) is a problematic instruction for many Intel Core variants (iirc it is only gracefully handled since xxxLake) as it generates a stall in the instruction pipeline. Test3 instead does a write/read to 0x4(%rbp) which is handeld ok by the majority of Intel Core variants.

Maybe it is better to tune the optimizer to avoid such instructions and generally replace with a slightly longer "movl mem, reg / incl reg / movl reg, mem" sequence?

Cheers,
MathMan

 

TinyPortal © 2005-2018