Recent

Author Topic: Optimizing the counter code  (Read 2098 times)

BrunoK

  • Sr. Member
  • ****
  • Posts: 298
  • Retired programmer
Re: Optimizing the counter code
« Reply #15 on: May 18, 2022, 03:39:06 pm »
@MathMan
Replacing
Code: Pascal  [Select][+][-]
  1.         I := I + 1;
by
Code: Pascal  [Select][+][-]
  1.         TempI := I;
  2.         Inc(TempI);
  3.         I := TempI;
makes Test4 nearly as fast as Test3.

Its like using %rax (or %eax) would be 'overheating' of being used :-)

PascalDragon

  • Hero Member
  • *****
  • Posts: 4123
  • Compiler Developer
Re: Optimizing the counter code
« Reply #16 on: May 19, 2022, 08:52:17 am »
@BrunoK, @PascalDragon

Looking carefully at the assembler excerpt comparing Test3 & Test4 routines I see

* in both cases var I is handled on the stack (always referenced as -0x4(%rbp))
* Test4 however does "incl -0x4(%rbp)" for Pascal "I := I+1" <= see lines 27-28 in the comparative assembler output

The latter one (Test4) is a problematic instruction for many Intel Core variants (iirc it is only gracefully handled since xxxLake) as it generates a stall in the instruction pipeline. Test3 instead does a write/read to 0x4(%rbp) which is handeld ok by the majority of Intel Core variants.

Maybe it is better to tune the optimizer to avoid such instructions and generally replace with a slightly longer "movl mem, reg / incl reg / movl reg, mem" sequence?

Someone report this with examples, please, and probably someone (like J. Gareth Moreton) will take a look at it.

MathMan

  • Sr. Member
  • ****
  • Posts: 262
Re: Optimizing the counter code
« Reply #17 on: May 19, 2022, 12:24:52 pm »
@BrunoK, @PascalDragon

Looking carefully at the assembler excerpt comparing Test3 & Test4 routines I see

* in both cases var I is handled on the stack (always referenced as -0x4(%rbp))
* Test4 however does "incl -0x4(%rbp)" for Pascal "I := I+1" <= see lines 27-28 in the comparative assembler output

The latter one (Test4) is a problematic instruction for many Intel Core variants (iirc it is only gracefully handled since xxxLake) as it generates a stall in the instruction pipeline. Test3 instead does a write/read to 0x4(%rbp) which is handeld ok by the majority of Intel Core variants.

Maybe it is better to tune the optimizer to avoid such instructions and generally replace with a slightly longer "movl mem, reg / incl reg / movl reg, mem" sequence?

Someone report this with examples, please, and probably someone (like J. Gareth Moreton) will take a look at it.

I'll try to address this. But please bear with me, as I haven't done this before and my home is an Internet-free zone - so I'll have some fiddling ahead of me ;-)

PascalDragon

  • Hero Member
  • *****
  • Posts: 4123
  • Compiler Developer
Re: Optimizing the counter code
« Reply #18 on: May 19, 2022, 01:45:16 pm »
I'll try to address this. But please bear with me, as I haven't done this before and my home is an Internet-free zone - so I'll have some fiddling ahead of me ;-)

Please note that Okoba already reported the initial problem here. So you might want to add your observations there instead of adding a completely new bug report.

BrunoK

  • Sr. Member
  • ****
  • Posts: 298
  • Retired programmer
Re: Optimizing the counter code
« Reply #19 on: May 19, 2022, 04:05:15 pm »
Simplified project that keeps what I think being relevant.
3 slightly different Test routines that show alternatives.

Timings Win64 :
Code: Pascal  [Select][+][-]
  1.                    ------------ -O1 -OoREGVAR------------------
  2.                  Ticks  Iterations                       Timings
  3.  Slow   Test2 :    219    99999999                    223.007 ms
  4. Fastest Test3 :     31    99999999                     31.585 ms
  5. Faster  Test4 :     63    99999999                     60.238 ms

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 10074
  • FPC developer.
Re: Optimizing the counter code
« Reply #20 on: May 19, 2022, 05:24:31 pm »
I tried with FPC 3.2.2 and trunk for win32. In both cases test2 and test4 are about equal, while test3 is about 3.5 times faster on an old Ivy Bridge (core i7 3770), which is not a -lake.

Code: [Select]
3.3.1:
     Slow   Test2 :     15     9999999                     18,255 ms
    Fastest Test3 :      0     9999999                      5,339 ms
    Faster  Test4 :     16     9999999                     18,452 ms


Code: [Select]
3.2.2:
     Slow   Test2 :     15     9999999                     18,297 ms
    Fastest Test3 :      0     9999999                      5,390 ms
    Faster  Test4 :     16     9999999                     18,520 ms

Okoba

  • Sr. Member
  • ****
  • Posts: 304
Re: Optimizing the counter code
« Reply #21 on: May 19, 2022, 05:51:33 pm »
Results on Trunk FPC and Lazarus, on i9-9900K
Quote
-O0 -OoREGVAR
    Slow   Test2 :     16     9999999                     16.528 ms
    Fastest Test3 :      0     9999999                      8.424 ms
    Faster  Test4 :     15     9999999                     15.142 ms

-O1 -OoREGVAR
     Slow   Test2 :     15     9999999                     18.844 ms
    Fastest Test3 :      0     9999999                      4.232 ms
    Faster  Test4 :     16     9999999                     15.830 ms

-O2 -OoREGVAR
     Slow   Test2 :     15     9999999                     18.834 ms
    Fastest Test3 :      0     9999999                      4.430 ms
    Faster  Test4 :     16     9999999                     16.207 ms

-O3 -OoREGVAR
     Slow   Test2 :     16     9999999                     18.702 ms
    Fastest Test3 :      0     9999999                      4.311 ms
    Faster  Test4 :     15     9999999                     15.903 ms

-O4 -OoREGVAR
     Slow   Test2 :     16     9999999                     18.435 ms
    Fastest Test3 :      0     9999999                      4.333 ms
    Faster  Test4 :     15     9999999                     15.913 ms

Lazarus Default Release Mode -OoREGVAR
     Slow   Test2 :     16     9999999                     18.525 ms
    Fastest Test3 :      0     9999999                      2.695 ms
    Faster  Test4 :     15     9999999                     16.014 ms

Lazarus Default Release Mode
     Slow   Test2 :     16     9999999                     19.037 ms
    Fastest Test3 :      0     9999999                      2.696 ms
    Faster  Test4 :     15     9999999                     16.096 ms

BrunoK

  • Sr. Member
  • ****
  • Posts: 298
  • Retired programmer
Re: Optimizing the counter code
« Reply #22 on: May 19, 2022, 06:37:48 pm »
My system is
11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz   2.42 GHz

Those benchmarking issues are (always, it was easier in the dual core epoch)  very frustrating.

I downloaded and retested the application I posted on this forum and Test4 is still much faster than Test2.

Was there a bug in the application I published ?
 

MathMan

  • Sr. Member
  • ****
  • Posts: 262
Re: Optimizing the counter code
« Reply #23 on: May 20, 2022, 11:55:02 am »
@marcov:
Your test made me unsure, whether I was tricked by false memory about the issue in general. So I re-read the official Intel assembler optimization guide. Indeed the issue exists (as I remembered) - it is mentioned as "dense RMW issue" for Sandy-Bridge architecture (and former). The point is that the Loop-Streming-Detector has issues with the amount of micro-ops generated by these instruction types in dense loops.

It was however lifted (if I understood the docs correct, they are a bit vague in that respect) at least with Haswell ongoing, maybe even with Ivy Bridge.

@Okoba:
Thanks for testing again. What is puzzling me is that Lazarus default release mode is generating faster code, than e.g. -O4 -OoREGVAR. Did you compile with e.g. Range Checks enabled when not in Lazarus default release mode?

@BrunoK:
Could it be that you are testing on a Laptop? If so benchmarks can vary a lot, if not done with extreme pre-caution like fixing the clock, binding to a specific core, executing initial excessive warm-up code etc.

Takeaway - I'll do some experiments on my systems (one Nehalem, one Skylake) and see if I can pinpoint this down to a reproducible case. If so I'll add this to Okobas case in the bug tracker. Otherwise I'll do some heavy backpaddling to save face  ;)

Cheers,
MathMan

Okoba

  • Sr. Member
  • ****
  • Posts: 304
Re: Optimizing the counter code
« Reply #24 on: May 20, 2022, 02:55:36 pm »
@MathMan, I did not change anything with the release mode except removing -OoREGVAR from the project. $RangeChecks seems off.

Quote
Lazarus Default Release Mode $RangeChecks ON
     Slow   Test2 :     16     9999999                     18.504 ms
    Fastest Test3 :      0     9999999                      2.749 ms
    Faster  Test4 :     15     9999999                     16.129 ms

Lazarus Default Release Mode $RangeChecks OFF
     Slow   Test2 :     15     9999999                     18.579 ms
    Fastest Test3 :      0     9999999                      2.703 ms
    Faster  Test4 :     16     9999999                     16.148 ms

bytebites

  • Sr. Member
  • ****
  • Posts: 481
Re: Optimizing the counter code
« Reply #25 on: May 21, 2022, 11:35:30 am »
Amd Rysen 9 3900X says (Fpc 3.3.1):
Quote
O1
     Slow   Test2 :     21     9999999                     21.020 ms
    Fastest Test3 :      4     9999999                      3.708 ms
    Faster  Test4 :     19     9999999                     18.765 ms
O3
     Slow   Test2 :     10     9999999                      9.786 ms
    Fastest Test3 :      4     9999999                      3.496 ms
    Faster  Test4 :      4     9999999                      4.301 ms
« Last Edit: May 21, 2022, 12:17:31 pm by bytebites »

BrunoK

  • Sr. Member
  • ****
  • Posts: 298
  • Retired programmer
Re: Optimizing the counter code
« Reply #26 on: May 21, 2022, 03:51:55 pm »
A synthesis of what the optimization levels are can be found in the wiki page :
https://wiki.freepascal.org/Optimization#Optimization_Switches
Note that anything compiled with O2 and above include, if not disabled, the REGVAR option by default.

AFAIK $RangeChecks is not relevant since the array was removed to concentrate on the loop and register optimization.

Given the different results for different processors there is probably NOT MUCH ROOM for a general optimization strategy for the complete palette of i386 / x86_64 machines. My laptop that is an Intel i5-1135G7 @ 2.40GHz does have different ratio between tests from a desktop that is an i3-6100 CPU @ 3.70GHz 3.70 GHz.

bytebites’s Amd Rysen 9 3900X in O3 shows the same ratio as my laptop but other testers have ratios in the same range as my desktop.

One must notice that the initial code involved a var parameter in the check(i) function that prevented I to be register cached. The function was inline'd and apparently the compiler managed to eliminate that code in mode O4.

My opinion is that with modern processors, speed improvement in program code becomes significant only when there is at least a 15 % difference across multiple runs and methods order call.


 

TinyPortal © 2005-2018