Optimizing the counter code

BrunoK

Sr. Member
Posts: 452
Retired programmer

Re: Optimizing the counter code

« Reply #15 on: May 18, 2022, 03:39:06 pm »

@MathMan
Replacing

        I := I + 1;

Code: Pascal [Select][+]

        TempI := I;
        Inc(TempI);
        I := TempI;

makes Test4 nearly as fast as Test3.

Its like using %rax (or %eax) would be 'overheating' of being used :-)

Logged

PascalDragon

Hero Member
Posts: 5486
Compiler Developer

Re: Optimizing the counter code

« Reply #16 on: May 19, 2022, 08:52:17 am »

Quote from: MathMan on May 18, 2022, 12:58:45 pm

@BrunoK, @PascalDragon

Looking carefully at the assembler excerpt comparing Test3 & Test4 routines I see

* in both cases var I is handled on the stack (always referenced as -0x4(%rbp))
* Test4 however does "incl -0x4(%rbp)" for Pascal "I := I+1" <= see lines 27-28 in the comparative assembler output

The latter one (Test4) is a problematic instruction for many Intel Core variants (iirc it is only gracefully handled since xxxLake) as it generates a stall in the instruction pipeline. Test3 instead does a write/read to 0x4(%rbp) which is handeld ok by the majority of Intel Core variants.

Maybe it is better to tune the optimizer to avoid such instructions and generally replace with a slightly longer "movl mem, reg / incl reg / movl reg, mem" sequence?

Someone report this with examples, please, and probably someone (like J. Gareth Moreton) will take a look at it.

Logged

MathMan

Sr. Member
Posts: 325

Re: Optimizing the counter code

« Reply #17 on: May 19, 2022, 12:24:52 pm »

Quote from: PascalDragon on May 19, 2022, 08:52:17 am

Quote from: MathMan on May 18, 2022, 12:58:45 pm
@BrunoK, @PascalDragon

Looking carefully at the assembler excerpt comparing Test3 & Test4 routines I see

* in both cases var I is handled on the stack (always referenced as -0x4(%rbp))
* Test4 however does "incl -0x4(%rbp)" for Pascal "I := I+1" <= see lines 27-28 in the comparative assembler output

The latter one (Test4) is a problematic instruction for many Intel Core variants (iirc it is only gracefully handled since xxxLake) as it generates a stall in the instruction pipeline. Test3 instead does a write/read to 0x4(%rbp) which is handeld ok by the majority of Intel Core variants.

Maybe it is better to tune the optimizer to avoid such instructions and generally replace with a slightly longer "movl mem, reg / incl reg / movl reg, mem" sequence?

Someone report this with examples, please, and probably someone (like J. Gareth Moreton) will take a look at it.

I'll try to address this. But please bear with me, as I haven't done this before and my home is an Internet-free zone - so I'll have some fiddling ahead of me ;-)

Logged

PascalDragon

Hero Member
Posts: 5486
Compiler Developer

Re: Optimizing the counter code

« Reply #18 on: May 19, 2022, 01:45:16 pm »

Quote from: MathMan on May 19, 2022, 12:24:52 pm

I'll try to address this. But please bear with me, as I haven't done this before and my home is an Internet-free zone - so I'll have some fiddling ahead of me ;-)

Please note that Okoba already reported the initial problem here. So you might want to add your observations there instead of adding a completely new bug report.

Logged

BrunoK

Sr. Member
Posts: 452
Retired programmer

Re: Optimizing the counter code

« Reply #19 on: May 19, 2022, 04:05:15 pm »

Simplified project that keeps what I think being relevant.
3 slightly different Test routines that show alternatives.

Timings Win64 :

Code: Pascal [Select][+]

                   ------------ -O1 -OoREGVAR------------------ 
                 Ticks  Iterations                       Timings
 Slow   Test2 :    219    99999999                    223.007 ms
Fastest Test3 :     31    99999999                     31.585 ms
Faster  Test4 :     63    99999999                     60.238 ms

pgmloopreg220519.zip (2.95 kB - downloaded 24 times.)

Logged

Okoba

Hero Member
Posts: 533

Re: Optimizing the counter code

« Reply #21 on: May 19, 2022, 05:51:33 pm »

Results on Trunk FPC and Lazarus, on i9-9900K

Quote

-O0 -OoREGVAR
Slow Test2 : 16 9999999 16.528 ms
Fastest Test3 : 0 9999999 8.424 ms
Faster Test4 : 15 9999999 15.142 ms

-O1 -OoREGVAR
Slow Test2 : 15 9999999 18.844 ms
Fastest Test3 : 0 9999999 4.232 ms
Faster Test4 : 16 9999999 15.830 ms

-O2 -OoREGVAR
Slow Test2 : 15 9999999 18.834 ms
Fastest Test3 : 0 9999999 4.430 ms
Faster Test4 : 16 9999999 16.207 ms

-O3 -OoREGVAR
Slow Test2 : 16 9999999 18.702 ms
Fastest Test3 : 0 9999999 4.311 ms
Faster Test4 : 15 9999999 15.903 ms

-O4 -OoREGVAR
Slow Test2 : 16 9999999 18.435 ms
Fastest Test3 : 0 9999999 4.333 ms
Faster Test4 : 15 9999999 15.913 ms

Lazarus Default Release Mode -OoREGVAR
Slow Test2 : 16 9999999 18.525 ms
Fastest Test3 : 0 9999999 2.695 ms
Faster Test4 : 15 9999999 16.014 ms

Lazarus Default Release Mode
Slow Test2 : 16 9999999 19.037 ms
Fastest Test3 : 0 9999999 2.696 ms
Faster Test4 : 15 9999999 16.096 ms

Logged

BrunoK

Sr. Member
Posts: 452
Retired programmer

Re: Optimizing the counter code

« Reply #22 on: May 19, 2022, 06:37:48 pm »

My system is
11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz 2.42 GHz

Those benchmarking issues are (always, it was easier in the dual core epoch) very frustrating.

I downloaded and retested the application I posted on this forum and Test4 is still much faster than Test2.

Was there a bug in the application I published ?

Logged

MathMan

Sr. Member
Posts: 325

Re: Optimizing the counter code

« Reply #23 on: May 20, 2022, 11:55:02 am »

@marcov:
Your test made me unsure, whether I was tricked by false memory about the issue in general. So I re-read the official Intel assembler optimization guide. Indeed the issue exists (as I remembered) - it is mentioned as "dense RMW issue" for Sandy-Bridge architecture (and former). The point is that the Loop-Streming-Detector has issues with the amount of micro-ops generated by these instruction types in dense loops.

It was however lifted (if I understood the docs correct, they are a bit vague in that respect) at least with Haswell ongoing, maybe even with Ivy Bridge.

@Okoba:
Thanks for testing again. What is puzzling me is that Lazarus default release mode is generating faster code, than e.g. -O4 -OoREGVAR. Did you compile with e.g. Range Checks enabled when not in Lazarus default release mode?

@BrunoK:
Could it be that you are testing on a Laptop? If so benchmarks can vary a lot, if not done with extreme pre-caution like fixing the clock, binding to a specific core, executing initial excessive warm-up code etc.

Takeaway - I'll do some experiments on my systems (one Nehalem, one Skylake) and see if I can pinpoint this down to a reproducible case. If so I'll add this to Okobas case in the bug tracker. Otherwise I'll do some heavy backpaddling to save face

Cheers,
MathMan

Logged

Okoba

Hero Member
Posts: 533

Re: Optimizing the counter code

« Reply #24 on: May 20, 2022, 02:55:36 pm »

@MathMan, I did not change anything with the release mode except removing -OoREGVAR from the project. $RangeChecks seems off.

Quote

Lazarus Default Release Mode $RangeChecks ON
Slow Test2 : 16 9999999 18.504 ms
Fastest Test3 : 0 9999999 2.749 ms
Faster Test4 : 15 9999999 16.129 ms

Lazarus Default Release Mode $RangeChecks OFF
Slow Test2 : 15 9999999 18.579 ms
Fastest Test3 : 0 9999999 2.703 ms
Faster Test4 : 16 9999999 16.148 ms

Logged

bytebites

Hero Member
Posts: 642

Re: Optimizing the counter code

« Reply #25 on: May 21, 2022, 11:35:30 am »

Amd Rysen 9 3900X says (Fpc 3.3.1):

Quote

O1
Slow Test2 : 21 9999999 21.020 ms
Fastest Test3 : 4 9999999 3.708 ms
Faster Test4 : 19 9999999 18.765 ms
O3
Slow Test2 : 10 9999999 9.786 ms
Fastest Test3 : 4 9999999 3.496 ms
Faster Test4 : 4 9999999 4.301 ms

« Last Edit: May 21, 2022, 12:17:31 pm by bytebites »

Logged

BrunoK

Sr. Member
Posts: 452
Retired programmer

Re: Optimizing the counter code

« Reply #26 on: May 21, 2022, 03:51:55 pm »

A synthesis of what the optimization levels are can be found in the wiki page :
https://wiki.freepascal.org/Optimization#Optimization_Switches
Note that anything compiled with O2 and above include, if not disabled, the REGVAR option by default.

AFAIK $RangeChecks is not relevant since the array was removed to concentrate on the loop and register optimization.

Given the different results for different processors there is probably NOT MUCH ROOM for a general optimization strategy for the complete palette of i386 / x86_64 machines. My laptop that is an Intel i5-1135G7 @ 2.40GHz does have different ratio between tests from a desktop that is an i3-6100 CPU @ 3.70GHz 3.70 GHz.

bytebites’s Amd Rysen 9 3900X in O3 shows the same ratio as my laptop but other testers have ratios in the same range as my desktop.

One must notice that the initial code involved a var parameter in the check(i) function that prevented I to be register cached. The function was inline'd and apparently the compiler managed to eliminate that code in mode O4.

My opinion is that with modern processors, speed improvement in program code becomes significant only when there is at least a 15 % difference across multiple runs and methods order call.

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: Optimizing the counter code (Read 3846 times)

BrunoK

Re: Optimizing the counter code

PascalDragon

Re: Optimizing the counter code

MathMan

Re: Optimizing the counter code

PascalDragon

Re: Optimizing the counter code

BrunoK

Re: Optimizing the counter code

marcov

Re: Optimizing the counter code

Okoba

Re: Optimizing the counter code

BrunoK

Re: Optimizing the counter code

MathMan

Re: Optimizing the counter code

Okoba

Re: Optimizing the counter code

bytebites

Re: Optimizing the counter code

BrunoK

Re: Optimizing the counter code

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook