Lazarus

Free Pascal => General => Topic started by: 440bx on May 24, 2020, 02:41:13 am

Title: Optimization suggestion for nested variable access
Post by: 440bx on May 24, 2020, 02:41:13 am
Hello,

Following the discussion in this thread, https://forum.lazarus.freepascal.org/index.php/topic,49459.0.html
I looked into FPC's "stack flattening" abilities.  To make a short story really short: FPC does not "flatten" nested procedure/function stacks (that's no surprise, no one claimed it did.)

However, there is a simple optimization in that case that FPC could do.  Consider the following small program:
Code: Pascal  [Select][+][-]
  1. {$APPTYPE CONSOLE}
  2.  
  3. program TestNestedVariableAccess;   { level 1 }
  4.  
  5.   procedure OuterProcedure();       { level 2 }
  6.   var
  7.     OuterVariable1, OuterVariable2, OuterVariable3 : integer;
  8.  
  9.     procedure Level3();             { level 3 }
  10.  
  11.       procedure Level4();           { level 4 }
  12.  
  13.         procedure Level5();         { level 5 }
  14.  
  15.           procedure Level6();       { level 6 }
  16.           begin
  17.             writeln('Level6');
  18.  
  19.             OuterVariable1 := 1;
  20.             OuterVariable2 := 2;
  21.             OuterVariable3 := 3;
  22.  
  23.             writeln('OuterVariable1 : ', OuterVariable1);
  24.             writeln('OuterVariable2 : ', OuterVariable2);
  25.             writeln('OuterVariable3 : ', OuterVariable3);
  26.           end;
  27.  
  28.         begin
  29.           writeln('Level5');
  30.           Level6();
  31.         end;
  32.  
  33.       begin
  34.         writeln('Level4');
  35.         Level5();
  36.       end;
  37.  
  38.     begin
  39.       writeln('level 3');
  40.       Level4();
  41.     end;
  42.  
  43.   begin
  44.     Level3();
  45.   end;
  46.  
  47.  
  48. begin
  49.   OuterProcedure();
  50. end.                    
When compiled with -O4 (64bit, Windows target), it produces the following code (abbreviated):
Code: ASM  [Select][+][-]
  1. TestNestedVariableAccess.lpr:19           OuterVariable1 := 1;
  2.                                           ; walk the frame pointers
  3. 00000001000015A9 488b442420               mov    0x20(%rsp),%rax
  4. 00000001000015AE 488b4020                 mov    0x20(%rax),%rax
  5. 00000001000015B2 488b4020                 mov    0x20(%rax),%rax
  6. 00000001000015B6 488b4020                 mov    0x20(%rax),%rax
  7. 00000001000015BA c7402001000000           movl   $0x1,0x20(%rax)
  8.  
  9. TestNestedVariableAccess.lpr:20           OuterVariable2 := 2;
  10.                                           ; walk the frame pointers (again)
  11. 00000001000015C1 488b442420               mov    0x20(%rsp),%rax
  12. 00000001000015C6 488b4020                 mov    0x20(%rax),%rax
  13. 00000001000015CA 488b4020                 mov    0x20(%rax),%rax
  14. 00000001000015CE 488b4020                 mov    0x20(%rax),%rax
  15. 00000001000015D2 c7402802000000           movl   $0x2,0x28(%rax)
  16.  
  17. TestNestedVariableAccess.lpr:21           OuterVariable3 := 3;
  18.                                           ; walk the frame pointers (and again - a road well traveled by now!)
  19. 00000001000015D9 488b442420               mov    0x20(%rsp),%rax
  20. 00000001000015DE 488b4020                 mov    0x20(%rax),%rax
  21. 00000001000015E2 488b4020                 mov    0x20(%rax),%rax
  22. 00000001000015E6 488b4020                 mov    0x20(%rax),%rax
  23. 00000001000015EA c7403003000000           movl   $0x3,0x30(%rax)
  24.  
Basically, the compiler is walking the stack chain for every variable reference in that scope.  It could walk it once upon entry to the Level6() procedure to have a pointer to that target stack frame of interest (in this case, OuterProcedure()'s stack frame) and use that pointer instead of walking the stack for every reference.

I just thought I'd mention it.  It's a fairly simple optimization.

Title: Re: Optimization suggestion for nested variable access
Post by: fcu on May 24, 2020, 07:03:42 am
i think fpc couldn't inlined nested proc/func , so stack frame is always built in each call   
Title: Re: Optimization suggestion for nested variable access
Post by: 440bx on May 24, 2020, 07:45:07 am
i think fpc couldn't inlined nested proc/func , so stack frame is always built in each call
The fact that it didn't inline is fine.  The problem is that the compiler walks the stack frames more than is necessary.  One walk at procedure entry is all it should need, not one per reference.

Title: Re: Optimization suggestion for nested variable access
Post by: jamie on May 24, 2020, 02:10:57 pm
Hello,

Following the discussion in this thread, https://forum.lazarus.freepascal.org/index.php/topic,49459.0.html
I looked into FPC's "stack flattening" abilities.  To make a short story really short: FPC does not "flatten" nested procedure/function stacks (that's no surprise, no one claimed it did.)

However, there is a simple optimization in that case that FPC could do.  Consider the following small program:
Code: Pascal  [Select][+][-]
  1. {$APPTYPE CONSOLE}
  2.  
  3. program TestNestedVariableAccess;   { level 1 }
  4.  
  5.   procedure OuterProcedure();       { level 2 }
  6.   var
  7.     OuterVariable1, OuterVariable2, OuterVariable3 : integer;
  8.  
  9.     procedure Level3();             { level 3 }
  10.  
  11.       procedure Level4();           { level 4 }
  12.  
  13.         procedure Level5();         { level 5 }
  14.  
  15.           procedure Level6();       { level 6 }
  16.           begin
  17.             writeln('Level6');
  18.  
  19.             OuterVariable1 := 1;
  20.             OuterVariable2 := 2;
  21.             OuterVariable3 := 3;
  22.  
  23.             writeln('OuterVariable1 : ', OuterVariable1);
  24.             writeln('OuterVariable2 : ', OuterVariable2);
  25.             writeln('OuterVariable3 : ', OuterVariable3);
  26.           end;
  27.  
  28.         begin
  29.           writeln('Level5');
  30.           Level6();
  31.         end;
  32.  
  33.       begin
  34.         writeln('Level4');
  35.         Level5();
  36.       end;
  37.  
  38.     begin
  39.       writeln('level 3');
  40.       Level4();
  41.     end;
  42.  
  43.   begin
  44.     Level3();
  45.   end;
  46.  
  47.  
  48. begin
  49.   OuterProcedure();
  50. end.                    
When compiled with -O4 (64bit, Windows target), it produces the following code (abbreviated):
Code: ASM  [Select][+][-]
  1. TestNestedVariableAccess.lpr:19           OuterVariable1 := 1;
  2.                                           ; walk the frame pointers
  3. 00000001000015A9 488b442420               mov    0x20(%rsp),%rax
  4. 00000001000015AE 488b4020                 mov    0x20(%rax),%rax
  5. 00000001000015B2 488b4020                 mov    0x20(%rax),%rax
  6. 00000001000015B6 488b4020                 mov    0x20(%rax),%rax
  7. 00000001000015BA c7402001000000           movl   $0x1,0x20(%rax)
  8.  
  9. TestNestedVariableAccess.lpr:20           OuterVariable2 := 2;
  10.                                           ; walk the frame pointers (again)
  11. 00000001000015C1 488b442420               mov    0x20(%rsp),%rax
  12. 00000001000015C6 488b4020                 mov    0x20(%rax),%rax
  13. 00000001000015CA 488b4020                 mov    0x20(%rax),%rax
  14. 00000001000015CE 488b4020                 mov    0x20(%rax),%rax
  15. 00000001000015D2 c7402802000000           movl   $0x2,0x28(%rax)
  16.  
  17. TestNestedVariableAccess.lpr:21           OuterVariable3 := 3;
  18.                                           ; walk the frame pointers (and again - a road well traveled by now!)
  19. 00000001000015D9 488b442420               mov    0x20(%rsp),%rax
  20. 00000001000015DE 488b4020                 mov    0x20(%rax),%rax
  21. 00000001000015E2 488b4020                 mov    0x20(%rax),%rax
  22. 00000001000015E6 488b4020                 mov    0x20(%rax),%rax
  23. 00000001000015EA c7403003000000           movl   $0x3,0x30(%rax)
  24.  
Basically, the compiler is walking the stack chain for every variable reference in that scope.  It could walk it once upon entry to the Level6() procedure to have a pointer to that target stack frame of interest (in this case, OuterProcedure()'s stack frame) and use that pointer instead of walking the stack for every reference.

I just thought I'd mention it.  It's a fairly simple optimization.

Looks like the compiler is putting the CPU on an exercise program.

Why does it repeat the same code over and over... ?

This looks like the example I presented with multiple variables being initialized.

Most likely the compiler would generate different code if those variables were actually, but who knows..

Responses should be interesting  %)

Title: Re: Optimization suggestion for nested variable access
Post by: marcov on May 24, 2020, 03:13:46 pm
Seems a task for CSE. Either the CSE misses it, or this is not within the scope of the CSE.

Delphi doesn't optimize this either.
Title: Re: Optimization suggestion for nested variable access
Post by: jamie on May 24, 2020, 04:36:54 pm
Seems a task for CSE. Either the CSE misses it, or this is not within the scope of the CSE.

Delphi doesn't optimize this either.
Just ran that through D, the output code is totally different.
Title: Re: Optimization suggestion for nested variable access
Post by: MarkMLl on May 24, 2020, 07:38:33 pm
It could walk it once upon entry to the Level6() procedure to have a pointer to that target stack frame of interest (in this case, OuterProcedure()'s stack frame) and use that pointer instead of walking the stack for every reference.

I just thought I'd mention it.  It's a fairly simple optimization.

If my memory serves me correctly, that sort of frame cache is sometimes referred to as a "display". I think I've used a (Modula-2) compiler that did this since some targets weren't as good at frame chasing as x86, and I believe I've seen it in hardware as well.

MarkMLl
Title: Re: Optimization suggestion for nested variable access
Post by: BeniBela on May 24, 2020, 07:58:45 pm
Fpc always sucks at optimization

Seems a task for CSE. Either the CSE misses it, or this is not within the scope of the CSE.

Delphi doesn't optimize this either.
Just ran that through D, the output code is totally different.

I just ran it through my Delphi 4

It gives the same construction

So at least fpc is not worse than a 20 year old compiler
Title: Re: Optimization suggestion for nested variable access
Post by: 440bx on May 24, 2020, 08:23:45 pm
If my memory serves me correctly, that sort of frame cache is sometimes referred to as a "display". I think I've used a (Modula-2) compiler that did this since some targets weren't as good at frame chasing as x86, and I believe I've seen it in hardware as well.
You got that right.  Some machines had hardware support to keep track of the display (a rather nice feature to have in hardware.)

Without hardware support, walking the chain of stack frames is usually less overhead than maintaining a complete display in software (N. Wirth ran into that.)  When doing it in software, which is what FPC does, it's important for the compiler to implement a limited display structure (a simple cache of selected stack frame pointers) to prevent having to walk the stack chain for every variable reference.  That's a simple optimization too and, depending on the number of references and the nesting depth, it can make a noticeable difference.


I just ran it through my Delphi 4
It gives the same construction
So at least fpc is not worse than a 20 year old compiler
This is speculation but, I suspect that part of the reason that PC class Pascal compilers don't optimize nested stack frame accesses is because of OOP.    I believe (just a belief) that nested functions/procedures became less common when Pascal got "OOP-ed".
Title: Re: Optimization suggestion for nested variable access
Post by: MarkMLl on May 24, 2020, 10:13:34 pm
If my memory serves me correctly, that sort of frame cache is sometimes referred to as a "display". I think I've used a (Modula-2) compiler that did this since some targets weren't as good at frame chasing as x86, and I believe I've seen it in hardware as well.
You got that right.  Some machines had hardware support to keep track of the display (a rather nice feature to have in hardware.)

If my memory is correct, I've seen one with a 6' square of LEDs showing the display state. I think the idea was that since it changed somewhat more slowly than the actual CPU registers etc. it was useful eyecandy.

http://www.cs.berkeley.edu/~culler/courses/cs252-s05/papers/burroughs.pdf figure 5 etc.

http://users.monash.edu.au/~ralphk/burroughs.html (note the size of the disc drive platter halfway down that page)

Quote
Without hardware support, walking the chain of stack frames is usually less overhead than maintaining a complete display in software (N. Wirth ran into that.)  When doing it in software, which is what FPC does, it's important for the compiler to implement a limited display structure (a simple cache of selected stack frame pointers) to prevent having to walk the stack chain for every variable reference.  That's a simple optimization too and, depending on the number of references and the nesting depth, it can make a noticeable difference.

Looking at it from a slightly different viewpoint, it would be interesting to know how well the CPU caches the stack, and whether it could be encouraged to handle a per-thread block earmarked as the  display in the most effective way possible.

MarkMLl
Title: Re: Optimization suggestion for nested variable access
Post by: marcov on May 24, 2020, 10:47:27 pm
This is speculation but, I suspect that part of the reason that PC class Pascal compilers don't optimize nested stack frame accesses is because of OOP.    I believe (just a belief) that nested functions/procedures became less common when Pascal got "OOP-ed".

Well. Delphi maybe (*), but FPC existed for years as non-OOP compiler. And D is also OOP afaik, so why doesn't it hold there?

While there is some truth in it, it is an overgeneralization

(*) one should wonder however what the 32-bit Delphi compiler (backend) was based on though. It might predate the Delphi dialect, technology wise.
Title: Re: Optimization suggestion for nested variable access
Post by: 440bx on May 25, 2020, 12:18:23 am
http://www.cs.berkeley.edu/~culler/courses/cs252-s05/papers/burroughs.pdf figure 5 etc.

http://users.monash.edu.au/~ralphk/burroughs.html (note the size of the disc drive platter halfway down that page)
Good stuff in that pdf.  The HTML page was a nice walk down the binary memory lane. :)

Looking at it from a slightly different viewpoint, it would be interesting to know how well the CPU caches the stack, and whether it could be encouraged to handle a per-thread block earmarked as the  display in the most effective way possible.
Yes, that is definitely something worth knowing.  To get an idea, I wrote the following two simple test programs:

Simple if (no nested accesses):
Code: Pascal  [Select][+][-]
  1. {$APPTYPE CONSOLE}
  2.  
  3. program TestSimpleIf;
  4.   { program to test the time consumed to test against some limits             }
  5.  
  6.   const
  7.     MILLION1   =   1000000;
  8.     MILLION10  =  10000000;
  9.     MILLION100 = 100000000;
  10.  
  11.     BILLION    = 1000 * MILLION1;
  12.  
  13.     LOOP_LIMIT = BILLION;
  14.  
  15.   var
  16.     VariableLimit                   : qword;
  17.  
  18.     Variable1, Variable2, Variable3 : qword;
  19.  
  20.   procedure TestIf();
  21.   var
  22.     i : qword;
  23.  
  24.   begin
  25.     writeln('constant limit, global variables - Loop STARTING');
  26.  
  27.     for i := 1 to LOOP_LIMIT do
  28.     begin
  29.       Variable1 := i;
  30.       Variable2 := i;
  31.       Variable3 := i;
  32.  
  33.       if i >= LOOP_LIMIT then
  34.       begin
  35.  
  36.         writeln('constant limit, global variables - Loop DONE ', i);
  37.         writeln;
  38.       end;
  39.     end;
  40.   end;
  41.  
  42.   procedure TestIfVariableLimit();
  43.   var
  44.     i : qword;
  45.  
  46.   begin
  47.     writeln('Variable limit, global variables - Loop STARTING');
  48.  
  49.     for i := 1 to VariableLimit do
  50.     begin
  51.       Variable1 := i;
  52.       Variable2 := i;
  53.       Variable3 := i;
  54.  
  55.       if i >= VariableLimit then
  56.       begin
  57.         writeln('Variable limit, global variables - Loop DONE ', i);
  58.         writeln;
  59.       end;
  60.     end;
  61.   end;
  62.  
  63.   procedure TestIfVariableLimitLocal();
  64.   var
  65.     i : qword;
  66.  
  67.   begin
  68.     writeln('Variable limit, local variables - Loop STARTING');
  69.  
  70.     for i := 1 to VariableLimit do
  71.     begin
  72.       Variable1 := i;
  73.       Variable2 := i;
  74.       Variable3 := i;
  75.  
  76.       if i >= VariableLimit then
  77.       begin
  78.         writeln('Variable limit, local variables - Loop DONE ', i);
  79.         writeln;
  80.       end;
  81.     end;
  82.   end;
  83.  
  84.  
  85. begin
  86.   VariableLimit := LOOP_LIMIT;
  87.  
  88.   TestIfVariableLimit();
  89. end.
  90.  
For a billion iterations, that program runs in just a tad over a second (note: that includes the load time).

Nested if (access to nested variables):
Code: Pascal  [Select][+][-]
  1. {$APPTYPE CONSOLE}
  2.  
  3. program TestNestedIf;
  4.   { program to test the time consumed to test against some limits             }
  5.  
  6.   const
  7.     MILLION1      =   1000000;
  8.     MILLION10     =  10000000;
  9.     MILLION100    = 100000000;
  10.  
  11.     BILLION       = 1000 * MILLION1;
  12.     BILLION5      =    5 * BILLION;
  13.  
  14.     LOOP_LIMIT    = BILLION;
  15.  
  16.   var
  17.     VariableLimit : qword;
  18.  
  19.   procedure OuterProcedure();                            { level 2 }
  20.   var
  21.     OuterVariable1, OuterVariable2, OuterVariable3 : qword;
  22.  
  23.  
  24.  
  25.     procedure Level3();                                  { level 3 }
  26.  
  27.       procedure Level4();                                { level 4 }
  28.  
  29.         procedure Level5();                              { level 5 }
  30.  
  31.           procedure TestIfVariableLimit();               { level 6 }
  32.           var
  33.             { FPC doesn't accept a nested variable as a            }
  34.             { "for" counter                                        }
  35.  
  36.             i : qword;
  37.  
  38.           begin
  39.             writeln('Variable limit - Loop STARTING');
  40.  
  41.             for i := 1 to VariableLimit do
  42.             begin
  43.               OuterVariable1 := i;
  44.               OuterVariable2 := i;
  45.               OuterVariable3 := i;
  46.  
  47.               if i >= VariableLimit then
  48.               begin
  49.                 writeln('Variable limit - Loop DONE ', i);
  50.                 writeln;
  51.               end;
  52.             end;
  53.           end;
  54.  
  55.           procedure TestIf();                            { level 6 }
  56.           var
  57.             i : qword;
  58.  
  59.           begin
  60.             writeln('Loop STARTING');
  61.  
  62.             for i := 1 to LOOP_LIMIT do
  63.             begin
  64.               OuterVariable1 := i;
  65.               OuterVariable2 := i;
  66.               OuterVariable3 := i;
  67.  
  68.               if i >= LOOP_LIMIT then
  69.               begin
  70.                 writeln('Loop DONE ', i);
  71.                 writeln;
  72.               end;
  73.             end;
  74.           end;
  75.  
  76.           procedure Level6();                            { level 6 }
  77.           begin
  78.             writeln('Level 6');
  79.             TestIfVariableLimit();
  80.           end;
  81.  
  82.         begin
  83.           writeln('Level 5');
  84.           Level6();
  85.         end;
  86.  
  87.       begin
  88.         writeln('Level 4');
  89.         Level5();
  90.       end;
  91.  
  92.     begin
  93.       writeln('level 3');
  94.       Level4();
  95.     end;
  96.  
  97.   begin
  98.     Level3();
  99.   end;
  100.  
  101.  
  102. begin
  103.   VariableLimit := LOOP_LIMIT;
  104.  
  105.   OuterProcedure();
  106. end.
  107.  
This one takes a little over 17 seconds.  IOW, it is about 17 times _slower_ than the program that has direct access to the variables it sets.  Walking the stack frame chain is expensive!.

NOTE: I use TCCLE which can time execution from the command line.  For this reason there is no timing mechanism in the program itself.



While there is some truth in it, it is an overgeneralization
It was speculation anyway.  I really don't know if it is an over-generalization or not but, I do believe that since classes/objects' data is accessed with just one de-reference, that is a disincentive for the compiler implementors to optimize nested variables accesses because the mantra has become "make it an object".




ETA:
The original test program from which those two are derived was meant to test how much time an "if" test took.  That's why the "if" is in the loop instead of outside.


Title: Re: Optimization suggestion for nested variable access
Post by: Martin_fr on May 25, 2020, 01:32:52 am
You are comparing global vs local vars.

Rewrite the test.

Case 1: your nested
Case 2: all nested are nested level 1 directly in OuterProcedure

That means in deep nested: ("TEstIf")
Code: Pascal  [Select][+][-]
  1. # [67] OuterVariable1 := i;
  2.         movq    -8(%rbp),%rax
  3.         movq    -8(%rax),%rax
  4.         movq    -8(%rax),%rax
  5.         movq    -8(%rax),%rax
  6.         movq    -16(%rbp),%rdx
  7.         movq    %rdx,-8(%rax)
  8.  

And in flat nested
Code: Pascal  [Select][+][-]
  1. # [171] OuterVariable1 := i;
  2.         movq    -8(%rbp),%rax
  3.         movq    -16(%rbp),%rdx
  4.         movq    %rdx,-8(%rax)
  5.  

reducing the "movq   -8(%rbp),%rax" to one quarter.

FPC 3.3.1  / 64 bit windows
             deep nest / flat nest
-O1       2.6 seconds / 1.7 seconds
-O4       1.9               /  0.7

FPC 3.0.4  / 64 bit windows
-O1       2.6 seconds / 1.8 seconds

Still a difference.
It about  half as fast (factor 2.5 with O4). Rather that factor ~17.
Of course it keeps one level of stack walking. But even if that was eliminated,  I doubt that makes factor 10 ?

No idea why you get such different numbers

Your un-nested (global var), on my machine
FPC 3.0.4  / 64 bit windows
-O1      1.8
-O4      0.7

So that is about the same speed as all procs with 1 level nested.


- all tests done with BILLION
- intel I7 8700K
- timings by gettickcount64
- all runs done twice or more, to spot any exceptions (none found)


Title: Re: Optimization suggestion for nested variable access
Post by: 440bx on May 25, 2020, 01:48:38 am
You are comparing global vs local vars.
That was the point of the test.  To measure the overhead of walking the stack chain to reach the variables.

As the nesting level increases, so does the performance loss.  Walking the stack chain only once upon entry to the function/procedure would eliminate that loss.  Note, it would be still be a bit slower because there would always be one (1) additional access through the cached frame pointer but, that loss would be small and, it would be almost fully independent of the nesting level.  Almost fully, because the time consumed by the initial walk upon entry depends on the nesting level.

Title: Re: Optimization suggestion for nested variable access
Post by: Martin_fr on May 25, 2020, 02:21:06 am
You are comparing global vs local vars.
That was the point of the test.  To measure the overhead of walking the stack chain to reach the variables.

As the nesting level increases, so does the performance loss.  Walking the stack chain only once upon entry to the function/procedure would eliminate that loss.  Note, it would be still be a bit slower because there would always be one (1) additional access through the cached frame pointer but, that loss would be small and, it would be almost fully independent of the nesting level.  Almost fully, because the time consumed by the initial walk upon entry depends on the nesting level.

Well, yes I wasn't 100% correct either.

Access to global vars is different even to accessing none-nested locals (though probably not much of a time diff).
Therefore both tests would need locals. One would need nested, and the other un-nested.

But missing the other point.
Even doing the exact same test as you did, I do not get the same results: For me the factor was 2 or 2.5 instead of 17.
How to get your numbers?



Btw, I copied the test as it is => So I probably only run "TestIfVariableLimit". The others are not called.


Title: Re: Optimization suggestion for nested variable access
Post by: 440bx on May 25, 2020, 03:34:45 am
Access to global vars is different even to accessing none-nested locals (though probably not much of a time diff).
Therefore both tests would need locals. One would need nested, and the other un-nested.
Fair enough and, I even forgot to test with local variables (tested only against globals.) That's also why the "local variables" test is missing the local variables <chuckle>.  Here is the corrected test program that includes the local variables:
Code: Pascal  [Select][+][-]
  1. {$APPTYPE CONSOLE}
  2.  
  3. program TestSimpleIf;
  4.   { program to test the time consumed to test against some limits             }
  5.  
  6.   const
  7.     MILLION1   =   1000000;
  8.     MILLION10  =  10000000;
  9.     MILLION100 = 100000000;
  10.  
  11.     BILLION    = 1000 * MILLION1;
  12.  
  13.     LOOP_LIMIT = BILLION;
  14.  
  15.   var
  16.     VariableLimit                   : qword;
  17.  
  18.     Variable1, Variable2, Variable3 : qword;
  19.  
  20.   procedure TestIf();
  21.   var
  22.     i : qword;
  23.  
  24.   begin
  25.     writeln('constant limit, global variables - Loop STARTING');
  26.  
  27.     for i := 1 to LOOP_LIMIT do
  28.     begin
  29.       Variable1 := i;
  30.       Variable2 := i;
  31.       Variable3 := i;
  32.  
  33.       if i >= LOOP_LIMIT then
  34.       begin
  35.  
  36.         writeln('constant limit, global variables - Loop DONE ', i);
  37.         writeln;
  38.       end;
  39.     end;
  40.   end;
  41.  
  42.   procedure TestIfVariableLimit();
  43.   var
  44.     i : qword;
  45.  
  46.   begin
  47.     writeln('Variable limit, global variables - Loop STARTING');
  48.  
  49.     for i := 1 to VariableLimit do
  50.     begin
  51.       Variable1 := i;
  52.       Variable2 := i;
  53.       Variable3 := i;
  54.  
  55.       if i >= VariableLimit then
  56.       begin
  57.         writeln('Variable limit, global variables - Loop DONE ', i);
  58.         writeln;
  59.       end;
  60.     end;
  61.   end;
  62.  
  63.   procedure TestIfVariableLimitLocal();
  64.   var
  65.     i                               : qword;
  66.     Variable1, Variable2, Variable3 : qword;
  67.  
  68.   begin
  69.     writeln('Variable limit, local variables - Loop STARTING');
  70.  
  71.     for i := 1 to VariableLimit do
  72.     begin
  73.       Variable1 := i;
  74.       Variable2 := i;
  75.       Variable3 := i;
  76.  
  77.       if i >= VariableLimit then
  78.       begin
  79.         writeln('Variable limit, local variables - Loop DONE ', i);
  80.         writeln;
  81.       end;
  82.     end;
  83.   end;
  84.  
  85.  
  86. begin
  87.   VariableLimit := LOOP_LIMIT;
  88.  
  89.   //TestIf();
  90.   //TestIfVariableLimit();
  91.   TestIfVariableLimitLocal();
  92. end.
The TestNestedIf program is unchanged. 


Even doing the exact same test as you did, I do not get the same results: For me the factor was 2 or 2.5 instead of 17.
How to get your numbers?
I just run the two programs, nothing else. 

Here are the results I get:

For the simple if :
Code: Text  [Select][+][-]
  1. Timer 1 on: 21:17:12
  2. Variable limit, local variables - Loop STARTING
  3. Variable limit, local variables - Loop DONE 1000000000
  4.  
  5. Timer 1 off: 21:17:13  Elapsed: 0:00:01.03

For the nested if:
Code: Text  [Select][+][-]
  1. Timer 1 on: 21:18:30
  2. level 3
  3. Level 4
  4. Level 5
  5. Level 6
  6. Variable limit - Loop STARTING
  7. Variable limit - Loop DONE 1000000000
  8.  
  9. Timer 1 off: 21:18:47  Elapsed: 0:00:17.50
That is a factor of 17 (thereabouts)

If you post the code you're using, maybe I'll get results similar to yours and it would give a clue as to why your results are different (proportionally) than mine.

Both programs compiled with -O4 for Windows 64 bit, "smaller rather than faster" _un_checked.

Hardware is an i860 @ 2.8Ghz running Win7

ETA:
FPC v3.0.4


Title: Re: Optimization suggestion for nested variable access
Post by: Martin_fr on May 25, 2020, 09:30:07 am
If you post the code you're using, maybe I'll get results similar to yours and it would give a clue as to why your results are different (proportionally) than mine.

Both programs compiled with -O4 for Windows 64 bit, "smaller rather than faster" _un_checked.

Hardware is an i860 @ 2.8Ghz running Win7

ETA:
FPC v3.0.4

I run the 2 exact versions that you have posted.

For the 3rd version that I run: I took your code, and simply (by copy and paste) moved each deeply nested proc, so it ended up directly nested into OuterProc. I did not keep the source, but it is easy to generate.

It appears that the cost for walking the stack has significantly gone down with more modern hardware. At least that is the only diff I can spot from the descriptions.

The other possible explanation could be that the code difference influences jump prediction (i.e. your cpu was worst affected by how well it predicted jump/branchen).
The cpu's ability to correctly forecast branching, can have severe effect on the timing. And as the test is based on a loop, this could be the cause for the time difference.

How does the timing change, if you for example "unroll" (by hand) 50 loop iterations, and run the loop "billion div 50" times ?
Title: Re: Optimization suggestion for nested variable access
Post by: MathMan on May 25, 2020, 10:42:15 am
If you post the code you're using, maybe I'll get results similar to yours and it would give a clue as to why your results are different (proportionally) than mine.

Both programs compiled with -O4 for Windows 64 bit, "smaller rather than faster" _un_checked.

Hardware is an i860 @ 2.8Ghz running Win7

ETA:
FPC v3.0.4

I run the 2 exact versions that you have posted.

For the 3rd version that I run: I took your code, and simply (by copy and paste) moved each deeply nested proc, so it ended up directly nested into OuterProc. I did not keep the source, but it is easy to generate.

It appears that the cost for walking the stack has significantly gone down with more modern hardware. At least that is the only diff I can spot from the descriptions.

The capabilities for register renaming, simultaneous operations in flight and the loop buffer size all have increased significantly between the processor versions mentioned. I would subscribe the measured differences mainly to this, and especially the first.

The other possible explanation could be that the code difference influences jump prediction (i.e. your cpu was worst affected by how well it predicted jump/branchen).
The cpu's ability to correctly forecast branching, can have severe effect on the timing. And as the test is based on a loop, this could be the cause for the time difference.

While the branch-predictor of the i860 is weaker than than the one from i8700 it should still predict correct nearly all of the 1 billion iterations, so I don't think that this one is the culprit.

How does the timing change, if you for example "unroll" (by hand) 50 loop iterations, and run the loop "billion div 50" times ?

Beware to carefully analyse the generated asm - depending on compiler capabilities (don't know if FPC 3.0.4 can, but) the address of the variable may be kept in a local register for the remaining 49 accesses, which would make the timing of the unrolled loop hard to compare to the rolled-one.
Title: Re: Optimization suggestion for nested variable access
Post by: marcov on May 25, 2020, 11:34:39 am
Sandy bridge (i3/5/7-2xxx and later) have a uop cache that avoids redecoding loops. In general it makes loops that fit the uop-cache quite a lot faster. But maybe in this case it is just general load/store performance.

I do wonder why both FPC and Delphi don't mark a framepointer for a certain frame as a intermediate value that is register coloured. It should be invariant inside a procedure.
Title: Re: Optimization suggestion for nested variable access
Post by: 440bx on May 25, 2020, 11:59:53 am
It appears that the cost for walking the stack has significantly gone down with more modern hardware. At least that is the only diff I can spot from the descriptions.

The other possible explanation could be that the code difference influences jump prediction (i.e. your cpu was worst affected by how well it predicted jump/branchen).
The cpu's ability to correctly forecast branching, can have severe effect on the timing. And as the test is based on a loop, this could be the cause for the time difference.
I have a hard time believing that those things could account for a difference that large between the results. 

I'd like you to run this little test program on your machine and report the times:
Code: Pascal  [Select][+][-]
  1. {$APPTYPE CONSOLE}
  2.  
  3. program PointerDereference;
  4.  
  5. type
  6.   PQWORD      = ^qword;
  7.   PPQWORD     = ^PQWORD;
  8.   PPPQWORD    = ^PPQWORD;
  9.   PPPPQWORD   = ^PPPQWORD;
  10.   PPPPPQWORD  = ^PPPPQWORD;
  11.  
  12.   const
  13.     MILLION1      =   1000000;
  14.     MILLION10     =  10000000;
  15.     MILLION100    = 100000000;
  16.  
  17.     BILLION       = 1000 * MILLION1;
  18.     BILLION5      =    5 * BILLION;
  19.  
  20.     LOOP_LIMIT    = BILLION;
  21.  
  22.   var
  23.     VariableLimit : qword;
  24.  
  25.   procedure TestIfVariableLimit();
  26.   var
  27.     { FPC doesn't accept a nested variable as a            }
  28.     { "for" counter                                        }
  29.  
  30.     i                                              : qword;
  31.  
  32.     OuterVariable1, OuterVariable2, OuterVariable3 : qword;
  33.  
  34.     a1, a2, a3 : PQWORD;
  35.     b1, b2, b3 : PPQWORD;
  36.     c1, c2, c3 : PPPQWORD;
  37.     d1, d2, d3 : PPPPQWORD;
  38.     e1, e2, e3 : PPPPPQWORD;
  39.  
  40.  
  41.   begin
  42.     a1 := @OuterVariable1;
  43.     b1 := @a1;
  44.     c1 := @b1;
  45.     d1 := @c1;
  46.     e1 := @d1;
  47.  
  48.     a2 := @OuterVariable2;
  49.     b2 := @a2;
  50.     c2 := @b2;
  51.     d2 := @c2;
  52.     e2 := @d2;
  53.  
  54.     a3 := @OuterVariable3;
  55.     b3 := @a3;
  56.     c3 := @b3;
  57.     d3 := @c3;
  58.     e3 := @d3;
  59.  
  60.  
  61.     writeln('Variable limit - Loop STARTING');
  62.  
  63.     for i := 1 to VariableLimit do
  64.     begin
  65.       e1^^^^^ := 1;
  66.       e2^^^^^ := 2;
  67.       e3^^^^^ := 3;
  68.  
  69.       if i >= VariableLimit then
  70.       begin
  71.         writeln('Variable limit - Loop DONE ', i);
  72.         writeln;
  73.         writeln('OuterVariable1 : ', OuterVariable1);
  74.         writeln('OuterVariable2 : ', OuterVariable2);
  75.         writeln('OuterVariable3 : ', OuterVariable3);
  76.       end;
  77.     end;
  78.   end;
  79.  
  80. begin
  81.   VariableLimit := BILLION;
  82.  
  83.   TestIfVariableLimit();
  84. end.            
The code above, on my machine, takes a tad over 17 seconds, which is what I expect since it is simulating the stack walk that's done in the other program.

Run it as is and, post your time.  Mine time is:
Code: Text  [Select][+][-]
  1. Timer 1 on:  5:51:32
  2. Variable limit - Loop STARTING
  3. Variable limit - Loop DONE 1000000000
  4.  
  5. OuterVariable1 : 1
  6. OuterVariable2 : 2
  7. OuterVariable3 : 3
  8. Timer 1 off:  5:51:50  Elapsed: 0:00:17.58
which is what I expected. 

ETA:
It would be nice too if the compiler realized that e1, e2 and e3 are invariants that could be moved out of the loop.
Title: Re: Optimization suggestion for nested variable access
Post by: Martin_fr on May 25, 2020, 12:35:47 pm
I'd like you to run this little test program on your machine and report the times:

3.0.4
-O1  2.87 seconds
-O4  1.95

O1:
Code: Pascal  [Select][+][-]
  1. # [67] e1^^^^^ := 1;
  2.         movq    -136(%rbp),%rax
  3.         movq    (%rax),%rax
  4.         movq    (%rax),%rax
  5.         movq    (%rax),%rax
  6.         movq    (%rax),%rax
  7.         movq    $1,(%rax)
  8.  
O4:
Code: Pascal  [Select][+][-]
  1. # [67] e1^^^^^ := 1;
  2.         movq    (%rbx),%rax
  3.         movq    (%rax),%rax
  4.         movq    (%rax),%rax
  5.         movq    (%rax),%rax
  6.         movq    $1,(%rax)
  7. .Ll20:
  8.  

Title: Re: Optimization suggestion for nested variable access
Post by: 440bx on May 25, 2020, 12:46:13 pm
O4:
Code: Pascal  [Select][+][-]
  1. # [67] e1^^^^^ := 1;
  2.         movq    (%rbx),%rax
  3.         movq    (%rax),%rax
  4.         movq    (%rax),%rax
  5.         movq    (%rax),%rax
  6.         movq    $1,(%rax)
  7. .Ll20:
  8.  
That's the exact same code I get and it takes a tad over 17 seconds to run on my machine.  I admit to being very surprised that current processors managed to improve on the i860 that much.  It's an impressive improvement.

Thank you for testing it.
Title: Re: Optimization suggestion for nested variable access
Post by: lucamar on May 25, 2020, 01:29:09 pm
Probably a stupid question but, how have any of you managed to compile that code with FPC 3.0.4? Whatever I try it always result in this error:
Code: [Select]
PointerDereference.pas(63,9) Error: (4007) Ordinal expression expected
which references the line:
Code: Pascal  [Select][+][-]
  1. for i := 1 to VariableLimit do

Variable i is a QWord which, according to the docs, is not  a proper "ordinal", so the error might be normal but both of you have managed to compile it. Hence my somewhat plaintive question: how?  :'(
Title: Re: Optimization suggestion for nested variable access
Post by: Martin_fr on May 25, 2020, 01:33:19 pm
My 64 bit 3.0.4 did the job.
I have not tried any 32 bit fpc.
And 3.3.1 (1 or 2 month old) gave the error.
Title: Re: Optimization suggestion for nested variable access
Post by: lucamar on May 25, 2020, 02:32:42 pm
My 64 bit 3.0.4 did the job.

Oh. That might be it: it's OK in 64 but not in 32 bits. Can someone confirm? It would be good to know for sure ... :-\
Title: Re: Optimization suggestion for nested variable access
Post by: ASerge on May 25, 2020, 03:37:41 pm
I think that in real applications more than the 3rd level of nesting is not used. And for bottlenecks, level 1 is usually used. Therefore, no one was interested in this type of optimization.
Title: Re: Optimization suggestion for nested variable access
Post by: Thaddy on May 25, 2020, 03:40:47 pm
My 64 bit 3.0.4 did the job.

Oh. That might be it: it's OK in 64 but not in 32 bits. Can someone confirm? It would be good to know for sure ... :-\
The limit for ordinals....? You know that..
Title: Re: Optimization suggestion for nested variable access
Post by: lucamar on May 25, 2020, 04:42:33 pm
The limit for ordinals....? You know that..

Yeah, I should but I must confess that most of my development is in 32 bits, so I'm no very keen about the differences in 64 bits. :-[

After re-reading the Reference guide all becomes clearer:
Quote from: 3.1.1 Ordinal types
Remark: Int64 and QWord are considered ordinal types on 64-bit CPUs. On 32-bit types they have some of the characteristics of ordinals, but they cannot be used e.g. in for loops.

So here ends this off-topic "excursion". Sorry about it. O:-)
Title: Re: Optimization suggestion for nested variable access
Post by: 440bx on May 25, 2020, 07:12:58 pm
My 64 bit 3.0.4 did the job.

Oh. That might be it: it's OK in 64 but not in 32 bits. Can someone confirm? It would be good to know for sure ... :-\
Yes, that's correct.  The loop counter can be a qword only when compiling for 64 bit because it is not considered an ordinal type in 32 bit.



I think that in real applications more than the 3rd level of nesting is not used. And for bottlenecks, level 1 is usually used. Therefore, no one was interested in this type of optimization.
That's probably true in OOP.  In structured programming, it's not unusual at all to exceed 3 levels and, aside from programming style, when there is recursion the number of levels can be very large (but they don't show in the code structure.)
Title: Re: Optimization suggestion for nested variable access
Post by: ASerge on May 26, 2020, 09:52:34 pm
In structured programming, it's not unusual at all to exceed 3 levels and, aside from programming style, when there is recursion the number of levels can be very large (but they don't show in the code structure.)
Strongly doubt.
Level one - the usual procedures that are most often used.
Level two - nested procedure. The convenience of the pascal language, which some languages lack. Allows you to hide the implementation, reducing the code's visibility, without losing clarity.
If the size of the level two procedure is large, then in principle you can use level three. But in practice, it is more convenient to do several procedures at level 2 with rare exception is when there is an interaction between procedures that needs to be hidden. But this is usually very complex code and sometimes it is more useful to convert it into classes or make several first-level procedures.
Level four usually talks about a very crumpled code and honestly, I've never seen this before.
Title: Re: Optimization suggestion for nested variable access
Post by: 440bx on May 27, 2020, 12:06:19 am
Strongly doubt.
I believe you.

Level one - the usual procedures that are most often used.
normal stuff.

Level two - nested procedure. The convenience of the pascal language, which some languages lack. Allows you to hide the implementation, reducing the code's visibility, without losing clarity.
and it's not just about hiding the implementation.  It is a way of letting the programmer know that those functions/procedures are not used anywhere else.  That is real purpose of nested functions/procedures: to declare and control their scope and visibility.  IOW, not create binary jigsaw puzzles (i.e, everything is a "first level" procedure.)

If the size of the level two procedure is large, then in principle you can use level three. But in practice, it is more convenient to do several procedures at level 2 with rare exception is when there is an interaction between procedures that needs to be hidden.
It's not just about size.  If a sequence of code is used _only_ in a particular function/procedure and it is used more than once then, it should be a nested procedure that is above the scope of its callers and, that rule is recursive.

But this is usually very complex code
Complex code has that tendency but, it can also be simple code that has an obvious logical hierarchy and code sequences in common among the hierarchy levels.

and sometimes it is more useful to convert it into classes or make several first-level procedures.
As I am sure you already know, before I convert a procedure or function to a class, I will issue the ultimate conversion command: "del <sourcefilename.pas>... result guaranteed to be clean and bug-free.  Making them first level procedures is correct only if those functions/procedures are general in nature (utilitarian, like formatting an integer and stuff like that), if they are specific and used only by one sequence of code in the program then nesting is the correct implementation.

Level four usually talks about a very crumpled code and honestly, I've never seen this before.
Our opinions differ about that.  When I see four levels, I see that the programmer made an effort to organize and structure the code.  I admit, I didn't use to see it very often and since OOP, it's become almost as rare as a unicorn with two tails.
Title: Re: Optimization suggestion for nested variable access
Post by: JernejL on May 27, 2020, 06:23:06 am
I have a really big game AI FSM machine - a lot of code in my game which i designed with nested subroutines so they could share variables.
Does this apply always, or can i just mark them as inline and that would solve it?
 
Title: Re: Optimization suggestion for nested variable access
Post by: 440bx on May 27, 2020, 07:14:54 am
Does this apply always, or can i just mark them as inline and that would solve it?
Marking them "inline" _might_ help.  _Might_ because what it really depends on is how many scopes above the current one the referenced variable is located. 

After seeing how much more efficient new processors are at walking the stack chain than older processors, I don't think it's justified to change the code in any way to attempt a "manual" optimization.  It would still be nice if the compiler did the optimization but, on newer processors that optimization isn't nearly as important as it used to be.

I use nested functions/procedures extensively and, if I inline a function/procedure, it's only because it is used only once, not because it might shorten the stack frames chain.


Title: Re: Optimization suggestion for nested variable access
Post by: PascalDragon on May 27, 2020, 09:31:03 am
I have a really big game AI FSM machine - a lot of code in my game which i designed with nested subroutines so they could share variables.
Does this apply always, or can i just mark them as inline and that would solve it?

If you have simple functions that for example simply do a calculation based on some variables of the parent frame then you might benefit from inlining. You'll need to measure it to see whether the performance gain outweighs the increase in binary size.
Title: Re: Optimization suggestion for nested variable access
Post by: ASBzone on May 27, 2020, 01:49:40 pm
That's the exact same code I get and it takes a tad over 17 seconds to run on my machine.  I admit to being very surprised that current processors managed to improve on the i860 that much.  It's an impressive improvement.

Thank you for testing it.

Interestingly enough, here's what I get:

Compiled on ... FPC 3.2.0-beta-r45317  64-bit  (-O4)
Hardware ...... Intel(R) Core(TM) i5-2410M CPU at 2.30GHz
OS ............ Windows 10 1909 x64

Runtimes of the executable on various systems:

21.183 seconds -- Above system
15.723 seconds -- Intel(R) Xeon(R) CPU E31235 at 3.20GHz
18.048 seconds -- Intel(R) Core(TM) i5-7300HQ CPU at 2.50GHz
14.207 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]

Of course, you do know that Intel is paying for *some* of that branch prediction advantage by way of security issues... :)

I need to test this on some more processors and see.   I don't have any newer ones to try it on as yet.
Title: Re: Optimization suggestion for nested variable access
Post by: ASerge on May 27, 2020, 07:43:36 pm
When I see four levels, I see that the programmer made an effort to organize and structure the code.
You are a rare lover of complications. But I don't recommend it to others. And the phrase about the fourth level and well-organized code - I was looking for a long time, where is the smile. Not found - a difficult case.
Title: Re: Optimization suggestion for nested variable access
Post by: winni on May 27, 2020, 08:16:03 pm
Hi!

In all those  years of Pascal:  I never saw a fourth level.
And I never used it .

So the fourth level seems to be a unicorn.

Winni
Title: Re: Optimization suggestion for nested variable access
Post by: 440bx on May 27, 2020, 08:42:31 pm
21.183 seconds -- Above system
15.723 seconds -- Intel(R) Xeon(R) CPU E31235 at 3.20GHz
18.048 seconds -- Intel(R) Core(TM) i5-7300HQ CPU at 2.50GHz
14.207 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]

I need to test this on some more processors and see.   I don't have any newer ones to try it on as yet.
The results you get are along the lines of what I expect. 

I can't put my finger on it but, executing this code sequence
Code: Pascal  [Select][+][-]
  1. # [67] e1^^^^^ := 1;
  2.         movq    (%rbx),%rax
  3.         movq    (%rax),%rax
  4.         movq    (%rax),%rax
  5.         movq    (%rax),%rax
  6.         movq    $1,(%rax)
  7. .Ll20:
  8.  
3 times (for e1, e2 and e3) in under 3 seconds doesn't add up.  Each of those instructions take on the order of 3 clock cycles, times 5, that's 15, times 3 (e1, e2, e3), that's 45, times a billion, that's 45 billlon.  IOW, a 3.7Ghz processor is magically executing somewhere in the ballpark of 20 billion instructions per second.  It doesn't add up.



You are a rare lover of complications.
I don't think of organization as creating complication, on the contrary.

But I don't recommend it to others.
I don't recommend any particular level depth.  What I "recommend" is for the code structure to be parallel to the problem structure.  The number of levels is irrelevant, the structure of the problem determines the structure of the solution.



In all those  years of Pascal:  I never saw a fourth level.
There are plenty of things that should be commonly seen that are either, never seen or, extremely rarely seen.  Their absence is no indication they are undesirable or their presence an indication of desirability.  I'm sure you've seen a few goto(s) but, I doubt you'd use that as a basis to recommend their use.
Title: Re: Optimization suggestion for nested variable access
Post by: marcov on May 27, 2020, 10:26:48 pm
I think that in real applications more than the 3rd level of nesting is not used. And for bottlenecks, level 1 is usually used. Therefore, no one was interested in this type of optimization.

As said, some of the optimizations are not level dependent. Caching an used parent frame in a register is not dependent on nesting depth.

But yeah, somebody that cares has to dive into such things (  ;) 440bx)
Title: Re: Optimization suggestion for nested variable access
Post by: ASBzone on May 28, 2020, 04:30:46 am

I can't put my finger on it but, executing this code sequence
Code: Pascal  [Select][+][-]
  1. # [67] e1^^^^^ := 1;
  2.         movq    (%rbx),%rax
  3.         movq    (%rax),%rax
  4.         movq    (%rax),%rax
  5.         movq    (%rax),%rax
  6.         movq    $1,(%rax)
  7. .Ll20:
  8.  
3 times (for e1, e2 and e3) in under 3 seconds doesn't add up.  Each of those instructions take on the order of 3 clock cycles, times 5, that's 15, times 3 (e1, e2, e3), that's 45, times a billion, that's 45 billlon.  IOW, a 3.7Ghz processor is magically executing somewhere in the ballpark of 20 billion instructions per second.  It doesn't add up.

Well, I have no 9th gen Intel here to test with, nor a Zen2 based Ryzen to test with -- yet.

We'll see when I get that. 
Title: Re: Optimization suggestion for nested variable access
Post by: MathMan on May 28, 2020, 02:08:33 pm
I can't put my finger on it but, executing this code sequence
Code: Pascal  [Select][+][-]
  1. # [67] e1^^^^^ := 1;
  2.         movq    (%rbx),%rax
  3.         movq    (%rax),%rax
  4.         movq    (%rax),%rax
  5.         movq    (%rax),%rax
  6.         movq    $1,(%rax)
  7. .Ll20:
  8.  
3 times (for e1, e2 and e3) in under 3 seconds doesn't add up.  Each of those instructions take on the order of 3 clock cycles, times 5, that's 15, times 3 (e1, e2, e3), that's 45, times a billion, that's 45 billlon.  IOW, a 3.7Ghz processor is magically executing somewhere in the ballpark of 20 billion instructions per second.  It doesn't add up.

Caveat - the following is just a rough breakdown, to really understand the differences between the cores/timings presented so far one would have to analyse processor performance counters.

Nevertheless, from a modern multi-issue CPU's perspective the calculation would look completely different. There are four memory reads and one memory write per variable. The most CPU's mentioned so far have two read-ports and one write port, which can be used in parallel (if enough capacity for address calculation is available). That means that the sequence can be executed in just 2 clock cycles in a continuous model. To really do this several conditions must be met, especially as the code reuses RAX all the way and meeting these conditions really makes the difference between Intel Core-I generations. So from a CPU perspective one iteration of the loop could execute in 3 (variables) * 2 (cycles) = 6 cycles. Then add some overhead (very minor) due to the "if" statement in the loop and you end up at 7-8 cycles per iteration = 7-8 billion cycles for the complete loop. So in theory it is possible to execute this in 2 sec on a 3.7 GHz machine.

Cheers,
MathMan
Title: Re: Optimization suggestion for nested variable access
Post by: marcov on May 28, 2020, 05:16:44 pm
Nevertheless, from a modern multi-issue CPU's perspective the calculation would look completely different. There are four memory reads and one memory write per variable. The most CPU's mentioned so far have two read-ports and one write port, which can be used in parallel (if enough capacity for address calculation is available). That means that the sequence can be executed in just 2 clock cycles in a continuous model. To really do this several conditions must be met, especially as the code reuses RAX all the way and meeting these conditions really makes the difference between Intel Core-I generations. So from a CPU perspective one iteration of the loop could execute in 3 (variables) * 2 (cycles) = 6 cycles. Then add some overhead (very minor) due to the "if" statement in the loop and you end up at 7-8 cycles per iteration = 7-8 billion cycles for the complete loop. So in theory it is possible to execute this in 2 sec on a 3.7 GHz machine.

If the loads are not independent, e.g. some of those are in the same cache line, do they still occupy a read port each?

cachelines have 64byte nowadays, and in these rather parameterless procedures, everything on the stack would be quite close together.
Title: Re: Optimization suggestion for nested variable access
Post by: 440bx on May 28, 2020, 08:32:13 pm
So from a CPU perspective one iteration of the loop could execute in 3 (variables) * 2 (cycles) = 6 cycles. Then add some overhead (very minor) due to the "if" statement in the loop and you end up at 7-8 cycles per iteration = 7-8 billion cycles for the complete loop. So in theory it is possible to execute this in 2 sec on a 3.7 GHz machine.
I see what you're saying but, this is what I cannot reconcile in my mind:

There are 3 such sequences, therefore 3 * 5 (instructions per sequence) = 15, one "if" at, let's say 1 clock cycle, that makes it 16, one "for" index test, again at 1 clock cycle, that's 17 and, there is a jump to implement the loop, another clock cycle, for a total of 18.  That's done a billion times, for a total of 18 billion cycles.  At 3.7Ghz, that would be 18/3.7 = 4.86 seconds.

Getting execution time to be under 4.86 seconds is conceivable but, getting it to be less than _half_ that time, that's the part I am having a hard time with. 
 
Title: Re: Optimization suggestion for nested variable access
Post by: Martin_fr on May 28, 2020, 09:27:03 pm
Ok, so below is the generated asm by fpc 3.3.1 (approx 1 or 2 month old), with -O4.

1 Billion loops.
This takes 1.937 secs. (tickcount64)

while it still does not help the result of your calc, an I7 8700K runs at up to 4.7 GHz (since the code is not multithreaded.).
I did not monitor, and as other software is running in the background it may have been at 4.6 or 4.5.

Your "18 div freq" would then be 18/4.7 = 3.8. Somehow that is exactly 2*1.9. So somehow the CPU must have optimized it further....
(though there are more than 18 asm statements)

According to https://easyperf.net/blog/2018/03/21/port-contention it seems to me, that the CPU kind of processes up to 2 instructions per time? (I did not spent to much time reading it, may have gotten it wrong)


Code: ASM  [Select][+][-]
  1. # [43] for i := 1 to VariableLimit do
  2.         movq    U_$P$TESTNESTEDIF_$$_VARIABLELIMIT(%rip),%rdi
  3.         cmpq    $1,%rdi
  4.         jnae    .Lj23
  5. # Peephole Optimization: xorq %rsi,%rsi -> xorl %esi,%esi (removes REX prefix)
  6.         xorl    %esi,%esi
  7.         .p2align 4,,10
  8.         .p2align 3
  9. .Lj24:
  10.         addq    $1,%rsi
  11. .Ll40:
  12. # [45] OuterVariable1 := i;
  13.         movq    32(%rsp),%rax
  14.         movq    32(%rax),%rax
  15.         movq    32(%rax),%rax
  16.         movq    32(%rax),%rax
  17.         movq    %rsi,32(%rax)
  18. .Ll41:
  19. # [46] OuterVariable2 := i;
  20.         movq    32(%rsp),%rax
  21.         movq    32(%rax),%rax
  22.         movq    32(%rax),%rax
  23.         movq    32(%rax),%rax
  24.         movq    %rsi,40(%rax)
  25. .Ll42:
  26. # [47] OuterVariable3 := i;
  27.         movq    32(%rsp),%rax
  28.         movq    32(%rax),%rax
  29.         movq    32(%rax),%rax
  30.         movq    32(%rax),%rax
  31.         movq    %rsi,48(%rax)
  32. .Ll43:
  33. # [49] if i >= VariableLimit then
  34.         cmpq    U_$P$TESTNESTEDIF_$$_VARIABLELIMIT(%rip),%rsi
  35.         jnae    .Lj28
  36. .Ll44:
  37. # [51] writeln('Variable limit - Loop DONE ', i);
  38.         call    fpc_get_output
  39.         movq    %rax,%rbx
  40.         leaq    _$TESTNESTEDIF$_Ld8(%rip),%r8
  41. # Peephole Optimization: %rbx = %rax; changed to minimise pipeline stall (MovMov2Mov 6a}
  42.         movq    %rax,%rdx
  43.         xorl    %ecx,%ecx
  44.         call    fpc_write_text_shortstr
  45.         call    fpc_iocheck
  46.         movq    %rsi,%r8
  47.         movq    %rbx,%rdx
  48.         xorl    %ecx,%ecx
  49.         call    fpc_write_text_uint
  50.         call    fpc_iocheck
  51.         movq    %rbx,%rcx
  52.         call    fpc_writeln_end
  53.         call    fpc_iocheck
  54. .Ll45:
  55. # [52] writeln;
  56.         call    fpc_get_output
  57.         movq    %rax,%rcx
  58.         call    fpc_writeln_end
  59.         call    fpc_iocheck
  60. .Lj28:
  61. .Ll46:
  62.         cmpq    %rdi,%rsi
  63.         jnae    .Lj24
  64. .Lj23:
  65. .Ll47:
  66. # [55] end;
  67.  
Title: Re: Optimization suggestion for nested variable access
Post by: Martin_fr on May 28, 2020, 09:49:35 pm
According to https://easyperf.net/blog/2018/03/21/port-contention it seems to me, that the CPU kind of processes up to 2 instructions per time? (I did not spent to much time reading it, may have gotten it wrong)

So according to https://en.wikipedia.org/wiki/Instructions_per_second

Look for the 8086 => afaik same chip, just silicon lottery selection for higher frequency:  7.39 IPC (for a single core).

IPC is per cycle, so frequency does not matter.
However that single core still has hyper threading.  So running only one thread, and my I7 8700k should do 3.69 IPC. 
(not exact, since hyper-threading would be slower, than 2 full real cores / so the value may be bigger for single thread)

Title: Re: Optimization suggestion for nested variable access
Post by: MarkMLl on May 28, 2020, 09:53:28 pm
Your "18 div freq" would then be 18/4.7 = 3.8. Somehow that is exactly 2*1.9. So somehow the CPU must have optimized it further....
(though there are more than 18 asm statements)

But based on my reading around when the Intel uCode flaws were forced into the open, those 18+ x86 or x86_64 opcodes would have been translated internally into triads of simpler instructions, which themselves would be optimised and swapped around at the hardware level. And AIUI there are various trace buffers etc. which mean that the lowest level optimisations can evolve with time.

MarkMLl
Title: Re: Optimization suggestion for nested variable access
Post by: MathMan on May 29, 2020, 08:07:37 am
So from a CPU perspective one iteration of the loop could execute in 3 (variables) * 2 (cycles) = 6 cycles. Then add some overhead (very minor) due to the "if" statement in the loop and you end up at 7-8 cycles per iteration = 7-8 billion cycles for the complete loop. So in theory it is possible to execute this in 2 sec on a 3.7 GHz machine.
I see what you're saying but, this is what I cannot reconcile in my mind:

There are 3 such sequences, therefore 3 * 5 (instructions per sequence) = 15, one "if" at, let's say 1 clock cycle, that makes it 16, one "for" index test, again at 1 clock cycle, that's 17 and, there is a jump to implement the loop, another clock cycle, for a total of 18.  That's done a billion times, for a total of 18 billion cycles.  At 3.7Ghz, that would be 18/3.7 = 4.86 seconds.

Getting execution time to be under 4.86 seconds is conceivable but, getting it to be less than _half_ that time, that's the part I am having a hard time with. 

Current CPU's are multi-issue - that is a single core can execute 2 (860 iirc), 3 (Core-I 2 Nehalem)) or even 4 (everything from Haswell onward) instructions per single clock. That might be the point resolving that mental barrier. Again - reaching this limit on actual code is tough, due to various internal limitations of a CPU. But it is feasable - I have coded some arithmetic functions that actually reach 3.5 instructions per clock (IPC) sustained. But I just see that Martin also explained the link to IPC.
Title: Re: Optimization suggestion for nested variable access
Post by: MathMan on May 29, 2020, 08:13:47 am
Nevertheless, from a modern multi-issue CPU's perspective the calculation would look completely different. There are four memory reads and one memory write per variable. The most CPU's mentioned so far have two read-ports and one write port, which can be used in parallel (if enough capacity for address calculation is available). That means that the sequence can be executed in just 2 clock cycles in a continuous model. To really do this several conditions must be met, especially as the code reuses RAX all the way and meeting these conditions really makes the difference between Intel Core-I generations. So from a CPU perspective one iteration of the loop could execute in 3 (variables) * 2 (cycles) = 6 cycles. Then add some overhead (very minor) due to the "if" statement in the loop and you end up at 7-8 cycles per iteration = 7-8 billion cycles for the complete loop. So in theory it is possible to execute this in 2 sec on a 3.7 GHz machine.

If the loads are not independent, e.g. some of those are in the same cache line, do they still occupy a read port each?

cachelines have 64byte nowadays, and in these rather parameterless procedures, everything on the stack would be quite close together.

Short answer - yes. Extended answer - the ports are not directly connected to the cache. Instead they work with read/write buffers to de-/multiplex arbitrary sized register info to/from arbitrary memory addresses. It would not be reasonable to try and provide such a DEMUX network into the L1 cache lines directly. However the buffers are aligned to the basic address scheme of the cache. That is they too have a 64 byte granularity to simplify data exchange with the L1 caches.
Title: Re: Optimization suggestion for nested variable access
Post by: 440bx on May 29, 2020, 08:50:46 am
Current CPU's are multi-issue - that is a single core can execute 2 (860 iirc), 3 (Core-I 2 Nehalem)) or even 4 (everything from Haswell onward) instructions per single clock.
I understand that but, that normally requires the sequence of instructions to be fairly (or totally) independent of each other.  In a sequence of instructions where one instruction depends on the result of the previous one, the CPU's ability to parallelize those instructions is severely compromised, particularly when the value to be used is not a calculated value but, a value retrieved from a yet to be known pointer.

Compilers are extremely "aware" of that limitation and it is because of it that they will very often rearrange instruction sequences whenever possible.

I accept that _somehow_ an i8700k manages to execute that sequence in less than half the time the logical dependencies among the instructions requires but, I don't see how. I'm sure some - probably many of the things you've mentioned - come into play but, I do not see how they could be sufficient.  It's like they managed to make clairvoyant CPUs that know the result before it's even calculated.  That sequence of instructions is hard (and "hard" is an understatement in this particular case) to parallelize even with highly speculative execution.

My hat off to the Intel guys that somehow managed that feat.
Title: Re: Optimization suggestion for nested variable access
Post by: MathMan on May 29, 2020, 10:45:33 am
I accept that _somehow_ an i8700k manages to execute that sequence in less than half the time the logical dependencies among the instructions requires but, I don't see how. I'm sure some - probably many of the things you've mentioned - come into play but, I do not see how they could be sufficient.  It's like they managed to make clairvoyant CPUs that know the result before it's even calculated.  That sequence of instructions is hard (and "hard" is an understatement in this particular case) to parallelize even with highly speculative execution.

I can fully understand your problems with accepting that there is parallelism in this. When looking at the assembler code it seems totally infeasible. I'll try to explain where the parallelism comes into play.

The first thing you have to understand here is that the assembler code is today only a very shallow representation of what a CPU is actually doing.

When looking at the 5 instruction sequence than in fact this is inherently sequential - they form whats called a "dependency chain". Lets assume that these five instructions represent the setting of the first variable. Then, and this is crucial for understanding, there is a completely independent chain of five instructions to set the second variable. These two independent chains could be executed in parallel - and the sequence for setting the third variable could again be executed in parallel to the first two. And so on, while you are round robbing through the loop. So - magically there is already parallelism!

Immediate reaction to that is "but wait, dont't all the assembler instructions refer to RAX? How can this be parallelized then?". That is, because the CPU internally isn't using RAX at all - it uses representations of RAX and there can be many (up to 168 on a Skylake Core-I 7xxx). So for the first 5 instructions it will use let's say RAX[1], on the second RAX[2], etc. And suddenly it can exploit the parallelism of the code.

Another reaction might be "but it has to execute one instruction after the other, otherwise there would be no chance to evaluate states like register content, condition codes etc.". When a CPU reads an instruction sequence it tags every instruction in a way that states the sequence. Internally it is then free to shuffle things around until the point where instructions have to be put off as "being executed" <= this is handled in the "retirement unit" of the CPU. And the "retirement unit" then sequences instructions back to the way they were read from the instruction stream. Now there can be several 100 instructions "in flight" inside the CPU core - and thus the parallelism of the code is actually exploited.

Of course there are instances when the CPU has to revert to some determined state <= that is everything is executed up to a specific instruction in the instruction stream. This happens implicitely e.g. when an interrupt occures, or explcitely with the CPUID instruction. But beside that, during "normal" execution you can not tell from the outside which instructions the CPU is actually working on! It's like "Schrödingers Katze" - only when you lift the lid you see in what state it is.

Hope that made this a bit clearer.
Title: Re: Optimization suggestion for nested variable access
Post by: marcov on May 29, 2020, 02:20:47 pm
Current CPU's are multi-issue - that is a single core can execute 2 (860 iirc), 3 (Core-I 2 Nehalem)) or even 4 (everything from Haswell onward) instructions per single clock.

6 uops even when executing from the uop cache.
Title: Re: Optimization suggestion for nested variable access
Post by: ASBzone on November 12, 2020, 05:38:41 am
I need to test this on some more processors and see.   I don't have any newer ones to try it on as yet.

I finally got my hands on a Ryzen 4000 series mobile device (Zen2) -- yes, just as Zen3 arrives.

Nonetheless, I revisited this to post the following update:

Compiled on ... FPC 3.2.1-r47152  64-bit  (-O4)
Hardware ...... AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]
OS ............ Windows 10 2009 x64

Runtimes of the executable on various systems:

02.882 seconds -- Above system
02.728 seconds -- AMD Ryzen 7 4800H with Radeon Graphics at 2.9Ghz
06.031 seconds -- Intel® Xeon® CPU E5530 at 2.40GHz [2400 Mhz]
14.797 seconds -- Intel® Core™ i7-4790K CPU at 4.00GHz [3991 Mhz]   **** Spectre/Meltdown mitigations enabled
20.380 seconds -- Intel(R) Core(TM) i5-2410M CPU at 2.30GHz         **** Spectre/Meltdown mitigations enabled

I notice that the performance is very different from when I had compiled it previously...

And then I tested some more and realized that the performance gap has everything to do with the various Spectre/Meltdown mitigations installed on the systems.   When I disabled them, the performance changed dramatically.

Even for the AMD systems.

OLD: 14.207 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]
NEW:  02.882 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]
Title: Re: Optimization suggestion for nested variable access
Post by: 440bx on November 12, 2020, 07:16:01 am
And then I tested some more and realized that the performance gap has everything to do with the various Spectre/Meltdown mitigations installed on the systems.   When I disabled them, the performance changed dramatically.

Even for the AMD systems.

OLD: 14.207 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]
NEW:  02.882 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]
Thank you for reporting that. 

Maybe a sequence of instructions that is subject to multi-stage speculative execution could be used to determine if the mitigations are present or absent.  <evil grin>
Title: Re: Optimization suggestion for nested variable access
Post by: MathMan on November 12, 2020, 01:59:32 pm
And then I tested some more and realized that the performance gap has everything to do with the various Spectre/Meltdown mitigations installed on the systems.   When I disabled them, the performance changed dramatically.

Even for the AMD systems.

OLD: 14.207 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]
NEW:  02.882 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]
Thank you for reporting that. 

Maybe a sequence of instructions that is subject to multi-stage speculative execution could be used to determine if the mitigations are present or absent.  <evil grin>

First of all I must say, that I am surprised that AMD cores seem to be equivalently affected by Spectre/Meltdown mitigation under Win10. Looks like MS is activating full mitigation regardless of CPU type.

A reasonable sequence is exactly the one that was presented in the beginning of this thread - a pointer chasing. But there might be easier ways to detect presence than by measuring execution times of pointer chases - e.g. directly querying the BIOS, or the OS. IIRC both can indicate if they have been enhanced to mitigate Spectre/Meltdown.
Title: Re: Optimization suggestion for nested variable access
Post by: ASBzone on November 12, 2020, 06:49:30 pm
First of all I must say, that I am surprised that AMD cores seem to be equivalently affected by Spectre/Meltdown mitigation under Win10. Looks like MS is activating full mitigation regardless of CPU type.

Possibly.  I suspect that the software side of the mitigation is onerous from the Windows side.   The only reason I even noticed here, is that I was doing some benchmarking a couple weeks back and noticed on the new laptop that it had a significant impact in just a few areas, and took it off to test.    Then I remembered that I wanted to revisit this, but the scores were all confusing until I remembered that I had disabled the mitigation on a couple of different systems.
Title: Re: Optimization suggestion for nested variable access
Post by: 440bx on November 12, 2020, 09:39:04 pm
Hope that made this a bit clearer.
In my efforts to see/understand how they managed to make that degree of parallelism occur in that sequence of instructions, I failed to thank you for your efforts to explain it. Therefore, here is the thank you I owed you. Thank you!.
TinyPortal © 2005-2018