Optimization suggestion for nested variable access

440bx

Hero Member
Posts: 4065

Re: Optimization suggestion for nested variable access

« Reply #15 on: May 25, 2020, 03:34:45 am »

Quote from: Martin_fr on May 25, 2020, 02:21:06 am

Access to global vars is different even to accessing none-nested locals (though probably not much of a time diff).
Therefore both tests would need locals. One would need nested, and the other un-nested.

Fair enough and, I even forgot to test with local variables (tested only against globals.) That's also why the "local variables" test is missing the local variables <chuckle>. Here is the corrected test program that includes the local variables:

Code: Pascal [Select][+]

{$APPTYPE CONSOLE}
 
program TestSimpleIf;
  { program to test the time consumed to test against some limits             }
 
  const
    MILLION1   =   1000000;
    MILLION10  =  10000000;
    MILLION100 = 100000000;
 
    BILLION    = 1000 * MILLION1;
 
    LOOP_LIMIT = BILLION;
 
  var
    VariableLimit                   : qword;
 
    Variable1, Variable2, Variable3 : qword;
 
  procedure TestIf();
  var
    i : qword;
 
  begin
    writeln('constant limit, global variables - Loop STARTING');
 
    for i := 1 to LOOP_LIMIT do
    begin
      Variable1 := i;
      Variable2 := i;
      Variable3 := i;
 
      if i >= LOOP_LIMIT then
      begin
 
        writeln('constant limit, global variables - Loop DONE ', i);
        writeln;
      end;
    end;
  end;
 
  procedure TestIfVariableLimit();
  var
    i : qword;
 
  begin
    writeln('Variable limit, global variables - Loop STARTING');
 
    for i := 1 to VariableLimit do
    begin
      Variable1 := i;
      Variable2 := i;
      Variable3 := i;
 
      if i >= VariableLimit then
      begin
        writeln('Variable limit, global variables - Loop DONE ', i);
        writeln;
      end;
    end;
  end;
 
  procedure TestIfVariableLimitLocal();
  var
    i                               : qword;
    Variable1, Variable2, Variable3 : qword;
 
  begin
    writeln('Variable limit, local variables - Loop STARTING');
 
    for i := 1 to VariableLimit do
    begin
      Variable1 := i;
      Variable2 := i;
      Variable3 := i;
 
      if i >= VariableLimit then
      begin
        writeln('Variable limit, local variables - Loop DONE ', i);
        writeln;
      end;
    end;
  end;
 
 
begin
  VariableLimit := LOOP_LIMIT;
 
  //TestIf();
  //TestIfVariableLimit();
  TestIfVariableLimitLocal();
end.

The TestNestedIf program is unchanged.

Quote from: Martin_fr on May 25, 2020, 02:21:06 am

Even doing the exact same test as you did, I do not get the same results: For me the factor was 2 or 2.5 instead of 17.
How to get your numbers?

I just run the two programs, nothing else.

Here are the results I get:

For the simple if :

Code: Text [Select][+]

Timer 1 on: 21:17:12
Variable limit, local variables - Loop STARTING
Variable limit, local variables - Loop DONE 1000000000
 
Timer 1 off: 21:17:13  Elapsed: 0:00:01.03

For the nested if:

Code: Text [Select][+]

Timer 1 on: 21:18:30
level 3
Level 4
Level 5
Level 6
Variable limit - Loop STARTING
Variable limit - Loop DONE 1000000000
 
Timer 1 off: 21:18:47  Elapsed: 0:00:17.50

That is a factor of 17 (thereabouts)

If you post the code you're using, maybe I'll get results similar to yours and it would give a clue as to why your results are different (proportionally) than mine.

Both programs compiled with -O4 for Windows 64 bit, "smaller rather than faster" _un_checked.

Hardware is an i860 @ 2.8Ghz running Win7

ETA:
FPC v3.0.4

« Last Edit: May 25, 2020, 03:38:04 am by 440bx »

Logged

(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

Martin_fr

Administrator
Hero Member
Posts: 9909
Debugger - SynEdit - and more

Re: Optimization suggestion for nested variable access

« Reply #16 on: May 25, 2020, 09:30:07 am »

Quote from: 440bx on May 25, 2020, 03:34:45 am

If you post the code you're using, maybe I'll get results similar to yours and it would give a clue as to why your results are different (proportionally) than mine.

Both programs compiled with -O4 for Windows 64 bit, "smaller rather than faster" _un_checked.

Hardware is an i860 @ 2.8Ghz running Win7

ETA:
FPC v3.0.4

I run the 2 exact versions that you have posted.

For the 3rd version that I run: I took your code, and simply (by copy and paste) moved each deeply nested proc, so it ended up directly nested into OuterProc. I did not keep the source, but it is easy to generate.

It appears that the cost for walking the stack has significantly gone down with more modern hardware. At least that is the only diff I can spot from the descriptions.

The other possible explanation could be that the code difference influences jump prediction (i.e. your cpu was worst affected by how well it predicted jump/branchen).
The cpu's ability to correctly forecast branching, can have severe effect on the timing. And as the test is based on a loop, this could be the cause for the time difference.

How does the timing change, if you for example "unroll" (by hand) 50 loop iterations, and run the loop "billion div 50" times ?

Logged

From the wiki: Ide Tools, Code completion and more / IDE cool features / Debugger Status

MathMan

Sr. Member
Posts: 325

Re: Optimization suggestion for nested variable access

« Reply #17 on: May 25, 2020, 10:42:15 am »

Quote from: Martin_fr on May 25, 2020, 09:30:07 am

Quote from: 440bx on May 25, 2020, 03:34:45 am
If you post the code you're using, maybe I'll get results similar to yours and it would give a clue as to why your results are different (proportionally) than mine.

Both programs compiled with -O4 for Windows 64 bit, "smaller rather than faster" _un_checked.

Hardware is an i860 @ 2.8Ghz running Win7

ETA:
FPC v3.0.4

I run the 2 exact versions that you have posted.

For the 3rd version that I run: I took your code, and simply (by copy and paste) moved each deeply nested proc, so it ended up directly nested into OuterProc. I did not keep the source, but it is easy to generate.

It appears that the cost for walking the stack has significantly gone down with more modern hardware. At least that is the only diff I can spot from the descriptions.

The capabilities for register renaming, simultaneous operations in flight and the loop buffer size all have increased significantly between the processor versions mentioned. I would subscribe the measured differences mainly to this, and especially the first.

Quote from: Martin_fr on May 25, 2020, 09:30:07 am

The other possible explanation could be that the code difference influences jump prediction (i.e. your cpu was worst affected by how well it predicted jump/branchen).
The cpu's ability to correctly forecast branching, can have severe effect on the timing. And as the test is based on a loop, this could be the cause for the time difference.

While the branch-predictor of the i860 is weaker than than the one from i8700 it should still predict correct nearly all of the 1 billion iterations, so I don't think that this one is the culprit.

Quote from: Martin_fr on May 25, 2020, 09:30:07 am

How does the timing change, if you for example "unroll" (by hand) 50 loop iterations, and run the loop "billion div 50" times ?

Beware to carefully analyse the generated asm - depending on compiler capabilities (don't know if FPC 3.0.4 can, but) the address of the variable may be kept in a local register for the remaining 49 accesses, which would make the timing of the unrolled loop hard to compare to the rolled-one.

Logged

marcov

Administrator
Hero Member
Posts: 11459
FPC developer.

Re: Optimization suggestion for nested variable access

« Reply #18 on: May 25, 2020, 11:34:39 am »

Sandy bridge (i3/5/7-2xxx and later) have a uop cache that avoids redecoding loops. In general it makes loops that fit the uop-cache quite a lot faster. But maybe in this case it is just general load/store performance.

I do wonder why both FPC and Delphi don't mark a framepointer for a certain frame as a intermediate value that is register coloured. It should be invariant inside a procedure.

« Last Edit: May 25, 2020, 12:48:58 pm by marcov »

Logged

440bx

Hero Member
Posts: 4065

Re: Optimization suggestion for nested variable access

« Reply #19 on: May 25, 2020, 11:59:53 am »

Quote from: Martin_fr on May 25, 2020, 09:30:07 am

It appears that the cost for walking the stack has significantly gone down with more modern hardware. At least that is the only diff I can spot from the descriptions.

The other possible explanation could be that the code difference influences jump prediction (i.e. your cpu was worst affected by how well it predicted jump/branchen).
The cpu's ability to correctly forecast branching, can have severe effect on the timing. And as the test is based on a loop, this could be the cause for the time difference.

I have a hard time believing that those things could account for a difference that large between the results.

I'd like you to run this little test program on your machine and report the times:

Code: Pascal [Select][+]

{$APPTYPE CONSOLE}
 
program PointerDereference;
 
type
  PQWORD      = ^qword;
  PPQWORD     = ^PQWORD;
  PPPQWORD    = ^PPQWORD;
  PPPPQWORD   = ^PPPQWORD;
  PPPPPQWORD  = ^PPPPQWORD;
 
  const
    MILLION1      =   1000000;
    MILLION10     =  10000000;
    MILLION100    = 100000000;
 
    BILLION       = 1000 * MILLION1;
    BILLION5      =    5 * BILLION;
 
    LOOP_LIMIT    = BILLION;
 
  var
    VariableLimit : qword;
 
  procedure TestIfVariableLimit();
  var
    { FPC doesn't accept a nested variable as a            }
    { "for" counter                                        }
 
    i                                              : qword;
 
    OuterVariable1, OuterVariable2, OuterVariable3 : qword;
 
    a1, a2, a3 : PQWORD;
    b1, b2, b3 : PPQWORD;
    c1, c2, c3 : PPPQWORD;
    d1, d2, d3 : PPPPQWORD;
    e1, e2, e3 : PPPPPQWORD;
 
 
  begin
    a1 := @OuterVariable1;
    b1 := @a1;
    c1 := @b1;
    d1 := @c1;
    e1 := @d1;
 
    a2 := @OuterVariable2;
    b2 := @a2;
    c2 := @b2;
    d2 := @c2;
    e2 := @d2;
 
    a3 := @OuterVariable3;
    b3 := @a3;
    c3 := @b3;
    d3 := @c3;
    e3 := @d3;
 
 
    writeln('Variable limit - Loop STARTING');
 
    for i := 1 to VariableLimit do
    begin
      e1^^^^^ := 1;
      e2^^^^^ := 2;
      e3^^^^^ := 3;
 
      if i >= VariableLimit then
      begin
        writeln('Variable limit - Loop DONE ', i);
        writeln;
        writeln('OuterVariable1 : ', OuterVariable1);
        writeln('OuterVariable2 : ', OuterVariable2);
        writeln('OuterVariable3 : ', OuterVariable3);
      end;
    end;
  end;
 
begin
  VariableLimit := BILLION;
 
  TestIfVariableLimit();
end.            

The code above, on my machine, takes a tad over 17 seconds, which is what I expect since it is simulating the stack walk that's done in the other program.

Run it as is and, post your time. Mine time is:

Code: Text [Select][+]

Timer 1 on:  5:51:32
Variable limit - Loop STARTING
Variable limit - Loop DONE 1000000000
 
OuterVariable1 : 1
OuterVariable2 : 2
OuterVariable3 : 3
Timer 1 off:  5:51:50  Elapsed: 0:00:17.58

which is what I expected.

ETA:
It would be nice too if the compiler realized that e1, e2 and e3 are invariants that could be moved out of the loop.

« Last Edit: May 25, 2020, 12:11:46 pm by 440bx »

Logged

(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

Martin_fr

Administrator
Hero Member
Posts: 9909
Debugger - SynEdit - and more

Re: Optimization suggestion for nested variable access

« Reply #20 on: May 25, 2020, 12:35:47 pm »

Quote from: 440bx on May 25, 2020, 11:59:53 am

I'd like you to run this little test program on your machine and report the times:

3.0.4
-O1 2.87 seconds
-O4 1.95

O1:

Code: Pascal [Select][+]

# [67] e1^^^^^ := 1;
        movq    -136(%rbp),%rax
        movq    (%rax),%rax
        movq    (%rax),%rax
        movq    (%rax),%rax
        movq    (%rax),%rax
        movq    $1,(%rax)
 

O4:

Code: Pascal [Select][+]

# [67] e1^^^^^ := 1;
        movq    (%rbx),%rax
        movq    (%rax),%rax
        movq    (%rax),%rax
        movq    (%rax),%rax
        movq    $1,(%rax)
.Ll20:
 

Logged

From the wiki: Ide Tools, Code completion and more / IDE cool features / Debugger Status

440bx

Hero Member
Posts: 4065

Re: Optimization suggestion for nested variable access

« Reply #21 on: May 25, 2020, 12:46:13 pm »

Quote from: Martin_fr on May 25, 2020, 12:35:47 pm

O4:
Code: Pascal [Select][+][-]
# [67] e1^^^^^ := 1;
movq (%rbx),%rax
movq (%rax),%rax
movq (%rax),%rax
movq (%rax),%rax
movq $1,(%rax)
.Ll20:

That's the exact same code I get and it takes a tad over 17 seconds to run on my machine. I admit to being very surprised that current processors managed to improve on the i860 that much. It's an impressive improvement.

Thank you for testing it.

Logged

(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

lucamar

Hero Member
Posts: 4219

Re: Optimization suggestion for nested variable access

« Reply #22 on: May 25, 2020, 01:29:09 pm »

Probably a stupid question but, how have any of you managed to compile that code with FPC 3.0.4? Whatever I try it always result in this error:

Code: [Select]

PointerDereference.pas(63,9) Error: (4007) Ordinal expression expected
which references the line:

Code: Pascal [Select][+]

for i := 1 to VariableLimit do

Variable i is a QWord which, according to the docs, is not a proper "ordinal", so the error might be normal but both of you have managed to compile it. Hence my somewhat plaintive question: how?

Logged

Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!)

Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

Martin_fr

Administrator
Hero Member
Posts: 9909
Debugger - SynEdit - and more

Re: Optimization suggestion for nested variable access

« Reply #23 on: May 25, 2020, 01:33:19 pm »

My 64 bit 3.0.4 did the job.
I have not tried any 32 bit fpc.
And 3.3.1 (1 or 2 month old) gave the error.

Logged

From the wiki: Ide Tools, Code completion and more / IDE cool features / Debugger Status

lucamar

Hero Member
Posts: 4219

Re: Optimization suggestion for nested variable access

« Reply #24 on: May 25, 2020, 02:32:42 pm »

Quote from: Martin_fr on May 25, 2020, 01:33:19 pm

My 64 bit 3.0.4 did the job.

Oh. That might be it: it's OK in 64 but not in 32 bits. Can someone confirm? It would be good to know for sure ... $:-\$

Logged

Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!)

Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

ASerge

Hero Member
Posts: 2249

Re: Optimization suggestion for nested variable access

« Reply #25 on: May 25, 2020, 03:37:41 pm »

I think that in real applications more than the 3rd level of nesting is not used. And for bottlenecks, level 1 is usually used. Therefore, no one was interested in this type of optimization.

Logged

Thaddy

Hero Member
Posts: 14387
Sensorship about opinions does not belong here.

Re: Optimization suggestion for nested variable access

« Reply #26 on: May 25, 2020, 03:40:47 pm »

Quote from: lucamar on May 25, 2020, 02:32:42 pm

Quote from: Martin_fr on May 25, 2020, 01:33:19 pm
My 64 bit 3.0.4 did the job.

Oh. That might be it: it's OK in 64 but not in 32 bits. Can someone confirm? It would be good to know for sure ... $:-\$

The limit for ordinals....? You know that..

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

lucamar

Hero Member
Posts: 4219

Re: Optimization suggestion for nested variable access

« Reply #27 on: May 25, 2020, 04:42:33 pm »

Quote from: Thaddy on May 25, 2020, 03:40:47 pm

The limit for ordinals....? You know that..

Yeah, I should but I must confess that most of my development is in 32 bits, so I'm no very keen about the differences in 64 bits.

After re-reading the Reference guide all becomes clearer:

Quote from: 3.1.1 Ordinal types

Remark: Int64 and QWord are considered ordinal types on 64-bit CPUs. On 32-bit types they have some of the characteristics of ordinals, but they cannot be used e.g. in for loops.

So here ends this off-topic "excursion". Sorry about it.

Logged

Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!)

Lazarus/FPC 2.0.8/3.0.4 & 2.0.12/3.2.0 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

440bx

Hero Member
Posts: 4065

Re: Optimization suggestion for nested variable access

« Reply #28 on: May 25, 2020, 07:12:58 pm »

Quote from: lucamar on May 25, 2020, 02:32:42 pm

Quote from: Martin_fr on May 25, 2020, 01:33:19 pm
My 64 bit 3.0.4 did the job.

Oh. That might be it: it's OK in 64 but not in 32 bits. Can someone confirm? It would be good to know for sure ... $:-\$

Yes, that's correct. The loop counter can be a qword only when compiling for 64 bit because it is not considered an ordinal type in 32 bit.

Quote from: ASerge on May 25, 2020, 03:37:41 pm

I think that in real applications more than the 3rd level of nesting is not used. And for bottlenecks, level 1 is usually used. Therefore, no one was interested in this type of optimization.

That's probably true in OOP. In structured programming, it's not unusual at all to exceed 3 levels and, aside from programming style, when there is recursion the number of levels can be very large (but they don't show in the code structure.)

Logged

(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

ASerge

Hero Member
Posts: 2249

Re: Optimization suggestion for nested variable access

« Reply #29 on: May 26, 2020, 09:52:34 pm »

Quote from: 440bx on May 25, 2020, 07:12:58 pm

In structured programming, it's not unusual at all to exceed 3 levels and, aside from programming style, when there is recursion the number of levels can be very large (but they don't show in the code structure.)

Strongly doubt.
Level one - the usual procedures that are most often used.
Level two - nested procedure. The convenience of the pascal language, which some languages lack. Allows you to hide the implementation, reducing the code's visibility, without losing clarity.
If the size of the level two procedure is large, then in principle you can use level three. But in practice, it is more convenient to do several procedures at level 2 with rare exception is when there is an interaction between procedures that needs to be hidden. But this is usually very complex code and sometimes it is more useful to convert it into classes or make several first-level procedures.
Level four usually talks about a very crumpled code and honestly, I've never seen this before.

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: Optimization suggestion for nested variable access (Read 9915 times)

440bx

Re: Optimization suggestion for nested variable access

Martin_fr

Re: Optimization suggestion for nested variable access

MathMan

Re: Optimization suggestion for nested variable access

marcov

Re: Optimization suggestion for nested variable access

440bx

Re: Optimization suggestion for nested variable access

Martin_fr

Re: Optimization suggestion for nested variable access

440bx

Re: Optimization suggestion for nested variable access

lucamar

Re: Optimization suggestion for nested variable access

Martin_fr

Re: Optimization suggestion for nested variable access

lucamar

Re: Optimization suggestion for nested variable access

ASerge

Re: Optimization suggestion for nested variable access

Thaddy

Re: Optimization suggestion for nested variable access

lucamar

Re: Optimization suggestion for nested variable access

440bx

Re: Optimization suggestion for nested variable access

ASerge

Re: Optimization suggestion for nested variable access

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook