Recent

Author Topic: Optimization suggestion for nested variable access  (Read 3119 times)

440bx

  • Hero Member
  • *****
  • Posts: 1896
Re: Optimization suggestion for nested variable access
« Reply #15 on: May 25, 2020, 03:34:45 am »
Access to global vars is different even to accessing none-nested locals (though probably not much of a time diff).
Therefore both tests would need locals. One would need nested, and the other un-nested.
Fair enough and, I even forgot to test with local variables (tested only against globals.) That's also why the "local variables" test is missing the local variables <chuckle>.  Here is the corrected test program that includes the local variables:
Code: Pascal  [Select][+][-]
  1. {$APPTYPE CONSOLE}
  2.  
  3. program TestSimpleIf;
  4.   { program to test the time consumed to test against some limits             }
  5.  
  6.   const
  7.     MILLION1   =   1000000;
  8.     MILLION10  =  10000000;
  9.     MILLION100 = 100000000;
  10.  
  11.     BILLION    = 1000 * MILLION1;
  12.  
  13.     LOOP_LIMIT = BILLION;
  14.  
  15.   var
  16.     VariableLimit                   : qword;
  17.  
  18.     Variable1, Variable2, Variable3 : qword;
  19.  
  20.   procedure TestIf();
  21.   var
  22.     i : qword;
  23.  
  24.   begin
  25.     writeln('constant limit, global variables - Loop STARTING');
  26.  
  27.     for i := 1 to LOOP_LIMIT do
  28.     begin
  29.       Variable1 := i;
  30.       Variable2 := i;
  31.       Variable3 := i;
  32.  
  33.       if i >= LOOP_LIMIT then
  34.       begin
  35.  
  36.         writeln('constant limit, global variables - Loop DONE ', i);
  37.         writeln;
  38.       end;
  39.     end;
  40.   end;
  41.  
  42.   procedure TestIfVariableLimit();
  43.   var
  44.     i : qword;
  45.  
  46.   begin
  47.     writeln('Variable limit, global variables - Loop STARTING');
  48.  
  49.     for i := 1 to VariableLimit do
  50.     begin
  51.       Variable1 := i;
  52.       Variable2 := i;
  53.       Variable3 := i;
  54.  
  55.       if i >= VariableLimit then
  56.       begin
  57.         writeln('Variable limit, global variables - Loop DONE ', i);
  58.         writeln;
  59.       end;
  60.     end;
  61.   end;
  62.  
  63.   procedure TestIfVariableLimitLocal();
  64.   var
  65.     i                               : qword;
  66.     Variable1, Variable2, Variable3 : qword;
  67.  
  68.   begin
  69.     writeln('Variable limit, local variables - Loop STARTING');
  70.  
  71.     for i := 1 to VariableLimit do
  72.     begin
  73.       Variable1 := i;
  74.       Variable2 := i;
  75.       Variable3 := i;
  76.  
  77.       if i >= VariableLimit then
  78.       begin
  79.         writeln('Variable limit, local variables - Loop DONE ', i);
  80.         writeln;
  81.       end;
  82.     end;
  83.   end;
  84.  
  85.  
  86. begin
  87.   VariableLimit := LOOP_LIMIT;
  88.  
  89.   //TestIf();
  90.   //TestIfVariableLimit();
  91.   TestIfVariableLimitLocal();
  92. end.
The TestNestedIf program is unchanged. 


Even doing the exact same test as you did, I do not get the same results: For me the factor was 2 or 2.5 instead of 17.
How to get your numbers?
I just run the two programs, nothing else. 

Here are the results I get:

For the simple if :
Code: Text  [Select][+][-]
  1. Timer 1 on: 21:17:12
  2. Variable limit, local variables - Loop STARTING
  3. Variable limit, local variables - Loop DONE 1000000000
  4.  
  5. Timer 1 off: 21:17:13  Elapsed: 0:00:01.03

For the nested if:
Code: Text  [Select][+][-]
  1. Timer 1 on: 21:18:30
  2. level 3
  3. Level 4
  4. Level 5
  5. Level 6
  6. Variable limit - Loop STARTING
  7. Variable limit - Loop DONE 1000000000
  8.  
  9. Timer 1 off: 21:18:47  Elapsed: 0:00:17.50
That is a factor of 17 (thereabouts)

If you post the code you're using, maybe I'll get results similar to yours and it would give a clue as to why your results are different (proportionally) than mine.

Both programs compiled with -O4 for Windows 64 bit, "smaller rather than faster" _un_checked.

Hardware is an i860 @ 2.8Ghz running Win7

ETA:
FPC v3.0.4


« Last Edit: May 25, 2020, 03:38:04 am by 440bx »
FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 6384
    • wiki
Re: Optimization suggestion for nested variable access
« Reply #16 on: May 25, 2020, 09:30:07 am »
If you post the code you're using, maybe I'll get results similar to yours and it would give a clue as to why your results are different (proportionally) than mine.

Both programs compiled with -O4 for Windows 64 bit, "smaller rather than faster" _un_checked.

Hardware is an i860 @ 2.8Ghz running Win7

ETA:
FPC v3.0.4

I run the 2 exact versions that you have posted.

For the 3rd version that I run: I took your code, and simply (by copy and paste) moved each deeply nested proc, so it ended up directly nested into OuterProc. I did not keep the source, but it is easy to generate.

It appears that the cost for walking the stack has significantly gone down with more modern hardware. At least that is the only diff I can spot from the descriptions.

The other possible explanation could be that the code difference influences jump prediction (i.e. your cpu was worst affected by how well it predicted jump/branchen).
The cpu's ability to correctly forecast branching, can have severe effect on the timing. And as the test is based on a loop, this could be the cause for the time difference.

How does the timing change, if you for example "unroll" (by hand) 50 loop iterations, and run the loop "billion div 50" times ?

MathMan

  • Full Member
  • ***
  • Posts: 196
Re: Optimization suggestion for nested variable access
« Reply #17 on: May 25, 2020, 10:42:15 am »
If you post the code you're using, maybe I'll get results similar to yours and it would give a clue as to why your results are different (proportionally) than mine.

Both programs compiled with -O4 for Windows 64 bit, "smaller rather than faster" _un_checked.

Hardware is an i860 @ 2.8Ghz running Win7

ETA:
FPC v3.0.4

I run the 2 exact versions that you have posted.

For the 3rd version that I run: I took your code, and simply (by copy and paste) moved each deeply nested proc, so it ended up directly nested into OuterProc. I did not keep the source, but it is easy to generate.

It appears that the cost for walking the stack has significantly gone down with more modern hardware. At least that is the only diff I can spot from the descriptions.

The capabilities for register renaming, simultaneous operations in flight and the loop buffer size all have increased significantly between the processor versions mentioned. I would subscribe the measured differences mainly to this, and especially the first.

The other possible explanation could be that the code difference influences jump prediction (i.e. your cpu was worst affected by how well it predicted jump/branchen).
The cpu's ability to correctly forecast branching, can have severe effect on the timing. And as the test is based on a loop, this could be the cause for the time difference.

While the branch-predictor of the i860 is weaker than than the one from i8700 it should still predict correct nearly all of the 1 billion iterations, so I don't think that this one is the culprit.

How does the timing change, if you for example "unroll" (by hand) 50 loop iterations, and run the loop "billion div 50" times ?

Beware to carefully analyse the generated asm - depending on compiler capabilities (don't know if FPC 3.0.4 can, but) the address of the variable may be kept in a local register for the remaining 49 accesses, which would make the timing of the unrolled loop hard to compare to the rolled-one.

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8491
  • FPC developer.
Re: Optimization suggestion for nested variable access
« Reply #18 on: May 25, 2020, 11:34:39 am »
Sandy bridge (i3/5/7-2xxx and later) have a uop cache that avoids redecoding loops. In general it makes loops that fit the uop-cache quite a lot faster. But maybe in this case it is just general load/store performance.

I do wonder why both FPC and Delphi don't mark a framepointer for a certain frame as a intermediate value that is register coloured. It should be invariant inside a procedure.
« Last Edit: May 25, 2020, 12:48:58 pm by marcov »

440bx

  • Hero Member
  • *****
  • Posts: 1896
Re: Optimization suggestion for nested variable access
« Reply #19 on: May 25, 2020, 11:59:53 am »
It appears that the cost for walking the stack has significantly gone down with more modern hardware. At least that is the only diff I can spot from the descriptions.

The other possible explanation could be that the code difference influences jump prediction (i.e. your cpu was worst affected by how well it predicted jump/branchen).
The cpu's ability to correctly forecast branching, can have severe effect on the timing. And as the test is based on a loop, this could be the cause for the time difference.
I have a hard time believing that those things could account for a difference that large between the results. 

I'd like you to run this little test program on your machine and report the times:
Code: Pascal  [Select][+][-]
  1. {$APPTYPE CONSOLE}
  2.  
  3. program PointerDereference;
  4.  
  5. type
  6.   PQWORD      = ^qword;
  7.   PPQWORD     = ^PQWORD;
  8.   PPPQWORD    = ^PPQWORD;
  9.   PPPPQWORD   = ^PPPQWORD;
  10.   PPPPPQWORD  = ^PPPPQWORD;
  11.  
  12.   const
  13.     MILLION1      =   1000000;
  14.     MILLION10     =  10000000;
  15.     MILLION100    = 100000000;
  16.  
  17.     BILLION       = 1000 * MILLION1;
  18.     BILLION5      =    5 * BILLION;
  19.  
  20.     LOOP_LIMIT    = BILLION;
  21.  
  22.   var
  23.     VariableLimit : qword;
  24.  
  25.   procedure TestIfVariableLimit();
  26.   var
  27.     { FPC doesn't accept a nested variable as a            }
  28.     { "for" counter                                        }
  29.  
  30.     i                                              : qword;
  31.  
  32.     OuterVariable1, OuterVariable2, OuterVariable3 : qword;
  33.  
  34.     a1, a2, a3 : PQWORD;
  35.     b1, b2, b3 : PPQWORD;
  36.     c1, c2, c3 : PPPQWORD;
  37.     d1, d2, d3 : PPPPQWORD;
  38.     e1, e2, e3 : PPPPPQWORD;
  39.  
  40.  
  41.   begin
  42.     a1 := @OuterVariable1;
  43.     b1 := @a1;
  44.     c1 := @b1;
  45.     d1 := @c1;
  46.     e1 := @d1;
  47.  
  48.     a2 := @OuterVariable2;
  49.     b2 := @a2;
  50.     c2 := @b2;
  51.     d2 := @c2;
  52.     e2 := @d2;
  53.  
  54.     a3 := @OuterVariable3;
  55.     b3 := @a3;
  56.     c3 := @b3;
  57.     d3 := @c3;
  58.     e3 := @d3;
  59.  
  60.  
  61.     writeln('Variable limit - Loop STARTING');
  62.  
  63.     for i := 1 to VariableLimit do
  64.     begin
  65.       e1^^^^^ := 1;
  66.       e2^^^^^ := 2;
  67.       e3^^^^^ := 3;
  68.  
  69.       if i >= VariableLimit then
  70.       begin
  71.         writeln('Variable limit - Loop DONE ', i);
  72.         writeln;
  73.         writeln('OuterVariable1 : ', OuterVariable1);
  74.         writeln('OuterVariable2 : ', OuterVariable2);
  75.         writeln('OuterVariable3 : ', OuterVariable3);
  76.       end;
  77.     end;
  78.   end;
  79.  
  80. begin
  81.   VariableLimit := BILLION;
  82.  
  83.   TestIfVariableLimit();
  84. end.            
The code above, on my machine, takes a tad over 17 seconds, which is what I expect since it is simulating the stack walk that's done in the other program.

Run it as is and, post your time.  Mine time is:
Code: Text  [Select][+][-]
  1. Timer 1 on:  5:51:32
  2. Variable limit - Loop STARTING
  3. Variable limit - Loop DONE 1000000000
  4.  
  5. OuterVariable1 : 1
  6. OuterVariable2 : 2
  7. OuterVariable3 : 3
  8. Timer 1 off:  5:51:50  Elapsed: 0:00:17.58
which is what I expected. 

ETA:
It would be nice too if the compiler realized that e1, e2 and e3 are invariants that could be moved out of the loop.
« Last Edit: May 25, 2020, 12:11:46 pm by 440bx »
FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 6384
    • wiki
Re: Optimization suggestion for nested variable access
« Reply #20 on: May 25, 2020, 12:35:47 pm »
I'd like you to run this little test program on your machine and report the times:

3.0.4
-O1  2.87 seconds
-O4  1.95

O1:
Code: Pascal  [Select][+][-]
  1. # [67] e1^^^^^ := 1;
  2.         movq    -136(%rbp),%rax
  3.         movq    (%rax),%rax
  4.         movq    (%rax),%rax
  5.         movq    (%rax),%rax
  6.         movq    (%rax),%rax
  7.         movq    $1,(%rax)
  8.  
O4:
Code: Pascal  [Select][+][-]
  1. # [67] e1^^^^^ := 1;
  2.         movq    (%rbx),%rax
  3.         movq    (%rax),%rax
  4.         movq    (%rax),%rax
  5.         movq    (%rax),%rax
  6.         movq    $1,(%rax)
  7. .Ll20:
  8.  


440bx

  • Hero Member
  • *****
  • Posts: 1896
Re: Optimization suggestion for nested variable access
« Reply #21 on: May 25, 2020, 12:46:13 pm »
O4:
Code: Pascal  [Select][+][-]
  1. # [67] e1^^^^^ := 1;
  2.         movq    (%rbx),%rax
  3.         movq    (%rax),%rax
  4.         movq    (%rax),%rax
  5.         movq    (%rax),%rax
  6.         movq    $1,(%rax)
  7. .Ll20:
  8.  
That's the exact same code I get and it takes a tad over 17 seconds to run on my machine.  I admit to being very surprised that current processors managed to improve on the i860 that much.  It's an impressive improvement.

Thank you for testing it.
FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

lucamar

  • Hero Member
  • *****
  • Posts: 2926
Re: Optimization suggestion for nested variable access
« Reply #22 on: May 25, 2020, 01:29:09 pm »
Probably a stupid question but, how have any of you managed to compile that code with FPC 3.0.4? Whatever I try it always result in this error:
Code: [Select]
PointerDereference.pas(63,9) Error: (4007) Ordinal expression expected
which references the line:
Code: Pascal  [Select][+][-]
  1. for i := 1 to VariableLimit do

Variable i is a QWord which, according to the docs, is not  a proper "ordinal", so the error might be normal but both of you have managed to compile it. Hence my somewhat plaintive question: how?  :'(
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 2.0.8/FPC 3.0.4 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 6384
    • wiki
Re: Optimization suggestion for nested variable access
« Reply #23 on: May 25, 2020, 01:33:19 pm »
My 64 bit 3.0.4 did the job.
I have not tried any 32 bit fpc.
And 3.3.1 (1 or 2 month old) gave the error.

lucamar

  • Hero Member
  • *****
  • Posts: 2926
Re: Optimization suggestion for nested variable access
« Reply #24 on: May 25, 2020, 02:32:42 pm »
My 64 bit 3.0.4 did the job.

Oh. That might be it: it's OK in 64 but not in 32 bits. Can someone confirm? It would be good to know for sure ... :-\
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 2.0.8/FPC 3.0.4 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

ASerge

  • Hero Member
  • *****
  • Posts: 1618
Re: Optimization suggestion for nested variable access
« Reply #25 on: May 25, 2020, 03:37:41 pm »
I think that in real applications more than the 3rd level of nesting is not used. And for bottlenecks, level 1 is usually used. Therefore, no one was interested in this type of optimization.

Thaddy

  • Hero Member
  • *****
  • Posts: 10199
Re: Optimization suggestion for nested variable access
« Reply #26 on: May 25, 2020, 03:40:47 pm »
My 64 bit 3.0.4 did the job.

Oh. That might be it: it's OK in 64 but not in 32 bits. Can someone confirm? It would be good to know for sure ... :-\
The limit for ordinals....? You know that..
I am more like donkey than shrek

lucamar

  • Hero Member
  • *****
  • Posts: 2926
Re: Optimization suggestion for nested variable access
« Reply #27 on: May 25, 2020, 04:42:33 pm »
The limit for ordinals....? You know that..

Yeah, I should but I must confess that most of my development is in 32 bits, so I'm no very keen about the differences in 64 bits. :-[

After re-reading the Reference guide all becomes clearer:
Quote from: 3.1.1 Ordinal types
Remark: Int64 and QWord are considered ordinal types on 64-bit CPUs. On 32-bit types they have some of the characteristics of ordinals, but they cannot be used e.g. in for loops.

So here ends this off-topic "excursion". Sorry about it. O:-)
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 2.0.8/FPC 3.0.4 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

440bx

  • Hero Member
  • *****
  • Posts: 1896
Re: Optimization suggestion for nested variable access
« Reply #28 on: May 25, 2020, 07:12:58 pm »
My 64 bit 3.0.4 did the job.

Oh. That might be it: it's OK in 64 but not in 32 bits. Can someone confirm? It would be good to know for sure ... :-\
Yes, that's correct.  The loop counter can be a qword only when compiling for 64 bit because it is not considered an ordinal type in 32 bit.



I think that in real applications more than the 3rd level of nesting is not used. And for bottlenecks, level 1 is usually used. Therefore, no one was interested in this type of optimization.
That's probably true in OOP.  In structured programming, it's not unusual at all to exceed 3 levels and, aside from programming style, when there is recursion the number of levels can be very large (but they don't show in the code structure.)
FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

ASerge

  • Hero Member
  • *****
  • Posts: 1618
Re: Optimization suggestion for nested variable access
« Reply #29 on: May 26, 2020, 09:52:34 pm »
In structured programming, it's not unusual at all to exceed 3 levels and, aside from programming style, when there is recursion the number of levels can be very large (but they don't show in the code structure.)
Strongly doubt.
Level one - the usual procedures that are most often used.
Level two - nested procedure. The convenience of the pascal language, which some languages lack. Allows you to hide the implementation, reducing the code's visibility, without losing clarity.
If the size of the level two procedure is large, then in principle you can use level three. But in practice, it is more convenient to do several procedures at level 2 with rare exception is when there is an interaction between procedures that needs to be hidden. But this is usually very complex code and sometimes it is more useful to convert it into classes or make several first-level procedures.
Level four usually talks about a very crumpled code and honestly, I've never seen this before.

 

TinyPortal © 2005-2018