### Bookstore

 Computer Math and Games in Pascal (preview) Lazarus Handbook

### Author Topic: Optimization suggestion for nested variable access  (Read 4452 times)

#### 440bx

• Hero Member
• Posts: 2047
##### Optimization suggestion for nested variable access
« on: May 24, 2020, 02:41:13 am »
Hello,

Following the discussion in this thread, https://forum.lazarus.freepascal.org/index.php/topic,49459.0.html
I looked into FPC's "stack flattening" abilities.  To make a short story really short: FPC does not "flatten" nested procedure/function stacks (that's no surprise, no one claimed it did.)

However, there is a simple optimization in that case that FPC could do.  Consider the following small program:
Code: Pascal  [Select][+][-]
1. {\$APPTYPE CONSOLE}
2.
3. program TestNestedVariableAccess;   { level 1 }
4.
5.   procedure OuterProcedure();       { level 2 }
6.   var
7.     OuterVariable1, OuterVariable2, OuterVariable3 : integer;
8.
9.     procedure Level3();             { level 3 }
10.
11.       procedure Level4();           { level 4 }
12.
13.         procedure Level5();         { level 5 }
14.
15.           procedure Level6();       { level 6 }
16.           begin
17.             writeln('Level6');
18.
19.             OuterVariable1 := 1;
20.             OuterVariable2 := 2;
21.             OuterVariable3 := 3;
22.
23.             writeln('OuterVariable1 : ', OuterVariable1);
24.             writeln('OuterVariable2 : ', OuterVariable2);
25.             writeln('OuterVariable3 : ', OuterVariable3);
26.           end;
27.
28.         begin
29.           writeln('Level5');
30.           Level6();
31.         end;
32.
33.       begin
34.         writeln('Level4');
35.         Level5();
36.       end;
37.
38.     begin
39.       writeln('level 3');
40.       Level4();
41.     end;
42.
43.   begin
44.     Level3();
45.   end;
46.
47.
48. begin
49.   OuterProcedure();
50. end.
When compiled with -O4 (64bit, Windows target), it produces the following code (abbreviated):
Code: ASM  [Select][+][-]
1. TestNestedVariableAccess.lpr:19           OuterVariable1 := 1;
2.                                           ; walk the frame pointers
3. 00000001000015A9 488b442420               mov    0x20(%rsp),%rax
4. 00000001000015AE 488b4020                 mov    0x20(%rax),%rax
5. 00000001000015B2 488b4020                 mov    0x20(%rax),%rax
6. 00000001000015B6 488b4020                 mov    0x20(%rax),%rax
7. 00000001000015BA c7402001000000           movl   \$0x1,0x20(%rax)
8.
9. TestNestedVariableAccess.lpr:20           OuterVariable2 := 2;
10.                                           ; walk the frame pointers (again)
11. 00000001000015C1 488b442420               mov    0x20(%rsp),%rax
12. 00000001000015C6 488b4020                 mov    0x20(%rax),%rax
13. 00000001000015CA 488b4020                 mov    0x20(%rax),%rax
14. 00000001000015CE 488b4020                 mov    0x20(%rax),%rax
15. 00000001000015D2 c7402802000000           movl   \$0x2,0x28(%rax)
16.
17. TestNestedVariableAccess.lpr:21           OuterVariable3 := 3;
18.                                           ; walk the frame pointers (and again - a road well traveled by now!)
19. 00000001000015D9 488b442420               mov    0x20(%rsp),%rax
20. 00000001000015DE 488b4020                 mov    0x20(%rax),%rax
21. 00000001000015E2 488b4020                 mov    0x20(%rax),%rax
22. 00000001000015E6 488b4020                 mov    0x20(%rax),%rax
23. 00000001000015EA c7403003000000           movl   \$0x3,0x30(%rax)
24.
Basically, the compiler is walking the stack chain for every variable reference in that scope.  It could walk it once upon entry to the Level6() procedure to have a pointer to that target stack frame of interest (in this case, OuterProcedure()'s stack frame) and use that pointer instead of walking the stack for every reference.

I just thought I'd mention it.  It's a fairly simple optimization.

« Last Edit: May 24, 2020, 02:44:29 am by 440bx »
FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

#### fcu

• Jr. Member
• Posts: 68
##### Re: Optimization suggestion for nested variable access
« Reply #1 on: May 24, 2020, 07:03:42 am »
i think fpc couldn't inlined nested proc/func , so stack frame is always built in each call

#### 440bx

• Hero Member
• Posts: 2047
##### Re: Optimization suggestion for nested variable access
« Reply #2 on: May 24, 2020, 07:45:07 am »
i think fpc couldn't inlined nested proc/func , so stack frame is always built in each call
The fact that it didn't inline is fine.  The problem is that the compiler walks the stack frames more than is necessary.  One walk at procedure entry is all it should need, not one per reference.

FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

#### jamie

• Hero Member
• Posts: 3784
##### Re: Optimization suggestion for nested variable access
« Reply #3 on: May 24, 2020, 02:10:57 pm »
Hello,

Following the discussion in this thread, https://forum.lazarus.freepascal.org/index.php/topic,49459.0.html
I looked into FPC's "stack flattening" abilities.  To make a short story really short: FPC does not "flatten" nested procedure/function stacks (that's no surprise, no one claimed it did.)

However, there is a simple optimization in that case that FPC could do.  Consider the following small program:
Code: Pascal  [Select][+][-]
1. {\$APPTYPE CONSOLE}
2.
3. program TestNestedVariableAccess;   { level 1 }
4.
5.   procedure OuterProcedure();       { level 2 }
6.   var
7.     OuterVariable1, OuterVariable2, OuterVariable3 : integer;
8.
9.     procedure Level3();             { level 3 }
10.
11.       procedure Level4();           { level 4 }
12.
13.         procedure Level5();         { level 5 }
14.
15.           procedure Level6();       { level 6 }
16.           begin
17.             writeln('Level6');
18.
19.             OuterVariable1 := 1;
20.             OuterVariable2 := 2;
21.             OuterVariable3 := 3;
22.
23.             writeln('OuterVariable1 : ', OuterVariable1);
24.             writeln('OuterVariable2 : ', OuterVariable2);
25.             writeln('OuterVariable3 : ', OuterVariable3);
26.           end;
27.
28.         begin
29.           writeln('Level5');
30.           Level6();
31.         end;
32.
33.       begin
34.         writeln('Level4');
35.         Level5();
36.       end;
37.
38.     begin
39.       writeln('level 3');
40.       Level4();
41.     end;
42.
43.   begin
44.     Level3();
45.   end;
46.
47.
48. begin
49.   OuterProcedure();
50. end.
When compiled with -O4 (64bit, Windows target), it produces the following code (abbreviated):
Code: ASM  [Select][+][-]
1. TestNestedVariableAccess.lpr:19           OuterVariable1 := 1;
2.                                           ; walk the frame pointers
3. 00000001000015A9 488b442420               mov    0x20(%rsp),%rax
4. 00000001000015AE 488b4020                 mov    0x20(%rax),%rax
5. 00000001000015B2 488b4020                 mov    0x20(%rax),%rax
6. 00000001000015B6 488b4020                 mov    0x20(%rax),%rax
7. 00000001000015BA c7402001000000           movl   \$0x1,0x20(%rax)
8.
9. TestNestedVariableAccess.lpr:20           OuterVariable2 := 2;
10.                                           ; walk the frame pointers (again)
11. 00000001000015C1 488b442420               mov    0x20(%rsp),%rax
12. 00000001000015C6 488b4020                 mov    0x20(%rax),%rax
13. 00000001000015CA 488b4020                 mov    0x20(%rax),%rax
14. 00000001000015CE 488b4020                 mov    0x20(%rax),%rax
15. 00000001000015D2 c7402802000000           movl   \$0x2,0x28(%rax)
16.
17. TestNestedVariableAccess.lpr:21           OuterVariable3 := 3;
18.                                           ; walk the frame pointers (and again - a road well traveled by now!)
19. 00000001000015D9 488b442420               mov    0x20(%rsp),%rax
20. 00000001000015DE 488b4020                 mov    0x20(%rax),%rax
21. 00000001000015E2 488b4020                 mov    0x20(%rax),%rax
22. 00000001000015E6 488b4020                 mov    0x20(%rax),%rax
23. 00000001000015EA c7403003000000           movl   \$0x3,0x30(%rax)
24.
Basically, the compiler is walking the stack chain for every variable reference in that scope.  It could walk it once upon entry to the Level6() procedure to have a pointer to that target stack frame of interest (in this case, OuterProcedure()'s stack frame) and use that pointer instead of walking the stack for every reference.

I just thought I'd mention it.  It's a fairly simple optimization.

Looks like the compiler is putting the CPU on an exercise program.

Why does it repeat the same code over and over... ?

This looks like the example I presented with multiple variables being initialized.

Most likely the compiler would generate different code if those variables were actually, but who knows..

Responses should be interesting

The only true wisdom is knowing you know nothing

#### marcov

• Global Moderator
• Hero Member
• Posts: 8894
• FPC developer.
##### Re: Optimization suggestion for nested variable access
« Reply #4 on: May 24, 2020, 03:13:46 pm »
Seems a task for CSE. Either the CSE misses it, or this is not within the scope of the CSE.

Delphi doesn't optimize this either.

#### jamie

• Hero Member
• Posts: 3784
##### Re: Optimization suggestion for nested variable access
« Reply #5 on: May 24, 2020, 04:36:54 pm »
Seems a task for CSE. Either the CSE misses it, or this is not within the scope of the CSE.

Delphi doesn't optimize this either.
Just ran that through D, the output code is totally different.
The only true wisdom is knowing you know nothing

#### MarkMLl

• Hero Member
• Posts: 1447
##### Re: Optimization suggestion for nested variable access
« Reply #6 on: May 24, 2020, 07:38:33 pm »
It could walk it once upon entry to the Level6() procedure to have a pointer to that target stack frame of interest (in this case, OuterProcedure()'s stack frame) and use that pointer instead of walking the stack for every reference.

I just thought I'd mention it.  It's a fairly simple optimization.

If my memory serves me correctly, that sort of frame cache is sometimes referred to as a "display". I think I've used a (Modula-2) compiler that did this since some targets weren't as good at frame chasing as x86, and I believe I've seen it in hardware as well.

MarkMLl
Turbo Pascal v1 on CCP/M-86, multitasking with LAN and graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.

#### BeniBela

• Hero Member
• Posts: 763
##### Re: Optimization suggestion for nested variable access
« Reply #7 on: May 24, 2020, 07:58:45 pm »
Fpc always sucks at optimization

Seems a task for CSE. Either the CSE misses it, or this is not within the scope of the CSE.

Delphi doesn't optimize this either.
Just ran that through D, the output code is totally different.

I just ran it through my Delphi 4

It gives the same construction

So at least fpc is not worse than a 20 year old compiler

#### 440bx

• Hero Member
• Posts: 2047
##### Re: Optimization suggestion for nested variable access
« Reply #8 on: May 24, 2020, 08:23:45 pm »
If my memory serves me correctly, that sort of frame cache is sometimes referred to as a "display". I think I've used a (Modula-2) compiler that did this since some targets weren't as good at frame chasing as x86, and I believe I've seen it in hardware as well.
You got that right.  Some machines had hardware support to keep track of the display (a rather nice feature to have in hardware.)

Without hardware support, walking the chain of stack frames is usually less overhead than maintaining a complete display in software (N. Wirth ran into that.)  When doing it in software, which is what FPC does, it's important for the compiler to implement a limited display structure (a simple cache of selected stack frame pointers) to prevent having to walk the stack chain for every variable reference.  That's a simple optimization too and, depending on the number of references and the nesting depth, it can make a noticeable difference.

I just ran it through my Delphi 4
It gives the same construction
So at least fpc is not worse than a 20 year old compiler
This is speculation but, I suspect that part of the reason that PC class Pascal compilers don't optimize nested stack frame accesses is because of OOP.    I believe (just a belief) that nested functions/procedures became less common when Pascal got "OOP-ed".
« Last Edit: May 24, 2020, 08:25:18 pm by 440bx »
FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

#### MarkMLl

• Hero Member
• Posts: 1447
##### Re: Optimization suggestion for nested variable access
« Reply #9 on: May 24, 2020, 10:13:34 pm »
If my memory serves me correctly, that sort of frame cache is sometimes referred to as a "display". I think I've used a (Modula-2) compiler that did this since some targets weren't as good at frame chasing as x86, and I believe I've seen it in hardware as well.
You got that right.  Some machines had hardware support to keep track of the display (a rather nice feature to have in hardware.)

If my memory is correct, I've seen one with a 6' square of LEDs showing the display state. I think the idea was that since it changed somewhat more slowly than the actual CPU registers etc. it was useful eyecandy.

http://www.cs.berkeley.edu/~culler/courses/cs252-s05/papers/burroughs.pdf figure 5 etc.

http://users.monash.edu.au/~ralphk/burroughs.html (note the size of the disc drive platter halfway down that page)

Quote
Without hardware support, walking the chain of stack frames is usually less overhead than maintaining a complete display in software (N. Wirth ran into that.)  When doing it in software, which is what FPC does, it's important for the compiler to implement a limited display structure (a simple cache of selected stack frame pointers) to prevent having to walk the stack chain for every variable reference.  That's a simple optimization too and, depending on the number of references and the nesting depth, it can make a noticeable difference.

Looking at it from a slightly different viewpoint, it would be interesting to know how well the CPU caches the stack, and whether it could be encouraged to handle a per-thread block earmarked as the  display in the most effective way possible.

MarkMLl
Turbo Pascal v1 on CCP/M-86, multitasking with LAN and graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.

#### marcov

• Global Moderator
• Hero Member
• Posts: 8894
• FPC developer.
##### Re: Optimization suggestion for nested variable access
« Reply #10 on: May 24, 2020, 10:47:27 pm »
This is speculation but, I suspect that part of the reason that PC class Pascal compilers don't optimize nested stack frame accesses is because of OOP.    I believe (just a belief) that nested functions/procedures became less common when Pascal got "OOP-ed".

Well. Delphi maybe (*), but FPC existed for years as non-OOP compiler. And D is also OOP afaik, so why doesn't it hold there?

While there is some truth in it, it is an overgeneralization

(*) one should wonder however what the 32-bit Delphi compiler (backend) was based on though. It might predate the Delphi dialect, technology wise.

#### 440bx

• Hero Member
• Posts: 2047
##### Re: Optimization suggestion for nested variable access
« Reply #11 on: May 25, 2020, 12:18:23 am »
http://www.cs.berkeley.edu/~culler/courses/cs252-s05/papers/burroughs.pdf figure 5 etc.

http://users.monash.edu.au/~ralphk/burroughs.html (note the size of the disc drive platter halfway down that page)
Good stuff in that pdf.  The HTML page was a nice walk down the binary memory lane.

Looking at it from a slightly different viewpoint, it would be interesting to know how well the CPU caches the stack, and whether it could be encouraged to handle a per-thread block earmarked as the  display in the most effective way possible.
Yes, that is definitely something worth knowing.  To get an idea, I wrote the following two simple test programs:

Simple if (no nested accesses):
Code: Pascal  [Select][+][-]
1. {\$APPTYPE CONSOLE}
2.
3. program TestSimpleIf;
4.   { program to test the time consumed to test against some limits             }
5.
6.   const
7.     MILLION1   =   1000000;
8.     MILLION10  =  10000000;
9.     MILLION100 = 100000000;
10.
11.     BILLION    = 1000 * MILLION1;
12.
13.     LOOP_LIMIT = BILLION;
14.
15.   var
16.     VariableLimit                   : qword;
17.
18.     Variable1, Variable2, Variable3 : qword;
19.
20.   procedure TestIf();
21.   var
22.     i : qword;
23.
24.   begin
25.     writeln('constant limit, global variables - Loop STARTING');
26.
27.     for i := 1 to LOOP_LIMIT do
28.     begin
29.       Variable1 := i;
30.       Variable2 := i;
31.       Variable3 := i;
32.
33.       if i >= LOOP_LIMIT then
34.       begin
35.
36.         writeln('constant limit, global variables - Loop DONE ', i);
37.         writeln;
38.       end;
39.     end;
40.   end;
41.
42.   procedure TestIfVariableLimit();
43.   var
44.     i : qword;
45.
46.   begin
47.     writeln('Variable limit, global variables - Loop STARTING');
48.
49.     for i := 1 to VariableLimit do
50.     begin
51.       Variable1 := i;
52.       Variable2 := i;
53.       Variable3 := i;
54.
55.       if i >= VariableLimit then
56.       begin
57.         writeln('Variable limit, global variables - Loop DONE ', i);
58.         writeln;
59.       end;
60.     end;
61.   end;
62.
63.   procedure TestIfVariableLimitLocal();
64.   var
65.     i : qword;
66.
67.   begin
68.     writeln('Variable limit, local variables - Loop STARTING');
69.
70.     for i := 1 to VariableLimit do
71.     begin
72.       Variable1 := i;
73.       Variable2 := i;
74.       Variable3 := i;
75.
76.       if i >= VariableLimit then
77.       begin
78.         writeln('Variable limit, local variables - Loop DONE ', i);
79.         writeln;
80.       end;
81.     end;
82.   end;
83.
84.
85. begin
86.   VariableLimit := LOOP_LIMIT;
87.
88.   TestIfVariableLimit();
89. end.
90.
For a billion iterations, that program runs in just a tad over a second (note: that includes the load time).

Code: Pascal  [Select][+][-]
1. {\$APPTYPE CONSOLE}
2.
3. program TestNestedIf;
4.   { program to test the time consumed to test against some limits             }
5.
6.   const
7.     MILLION1      =   1000000;
8.     MILLION10     =  10000000;
9.     MILLION100    = 100000000;
10.
11.     BILLION       = 1000 * MILLION1;
12.     BILLION5      =    5 * BILLION;
13.
14.     LOOP_LIMIT    = BILLION;
15.
16.   var
17.     VariableLimit : qword;
18.
19.   procedure OuterProcedure();                            { level 2 }
20.   var
21.     OuterVariable1, OuterVariable2, OuterVariable3 : qword;
22.
23.
24.
25.     procedure Level3();                                  { level 3 }
26.
27.       procedure Level4();                                { level 4 }
28.
29.         procedure Level5();                              { level 5 }
30.
31.           procedure TestIfVariableLimit();               { level 6 }
32.           var
33.             { FPC doesn't accept a nested variable as a            }
34.             { "for" counter                                        }
35.
36.             i : qword;
37.
38.           begin
39.             writeln('Variable limit - Loop STARTING');
40.
41.             for i := 1 to VariableLimit do
42.             begin
43.               OuterVariable1 := i;
44.               OuterVariable2 := i;
45.               OuterVariable3 := i;
46.
47.               if i >= VariableLimit then
48.               begin
49.                 writeln('Variable limit - Loop DONE ', i);
50.                 writeln;
51.               end;
52.             end;
53.           end;
54.
55.           procedure TestIf();                            { level 6 }
56.           var
57.             i : qword;
58.
59.           begin
60.             writeln('Loop STARTING');
61.
62.             for i := 1 to LOOP_LIMIT do
63.             begin
64.               OuterVariable1 := i;
65.               OuterVariable2 := i;
66.               OuterVariable3 := i;
67.
68.               if i >= LOOP_LIMIT then
69.               begin
70.                 writeln('Loop DONE ', i);
71.                 writeln;
72.               end;
73.             end;
74.           end;
75.
76.           procedure Level6();                            { level 6 }
77.           begin
78.             writeln('Level 6');
79.             TestIfVariableLimit();
80.           end;
81.
82.         begin
83.           writeln('Level 5');
84.           Level6();
85.         end;
86.
87.       begin
88.         writeln('Level 4');
89.         Level5();
90.       end;
91.
92.     begin
93.       writeln('level 3');
94.       Level4();
95.     end;
96.
97.   begin
98.     Level3();
99.   end;
100.
101.
102. begin
103.   VariableLimit := LOOP_LIMIT;
104.
105.   OuterProcedure();
106. end.
107.
This one takes a little over 17 seconds.  IOW, it is about 17 times _slower_ than the program that has direct access to the variables it sets.  Walking the stack frame chain is expensive!.

NOTE: I use TCCLE which can time execution from the command line.  For this reason there is no timing mechanism in the program itself.

While there is some truth in it, it is an overgeneralization
It was speculation anyway.  I really don't know if it is an over-generalization or not but, I do believe that since classes/objects' data is accessed with just one de-reference, that is a disincentive for the compiler implementors to optimize nested variables accesses because the mantra has become "make it an object".

ETA:
The original test program from which those two are derived was meant to test how much time an "if" test took.  That's why the "if" is in the loop instead of outside.

« Last Edit: May 25, 2020, 12:22:31 am by 440bx »
FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

#### Martin_fr

• Hero Member
• Posts: 6708
• Debugger - SynEdit - and more
##### Re: Optimization suggestion for nested variable access
« Reply #12 on: May 25, 2020, 01:32:52 am »
You are comparing global vs local vars.

Rewrite the test.

Case 2: all nested are nested level 1 directly in OuterProcedure

That means in deep nested: ("TEstIf")
Code: Pascal  [Select][+][-]
1. # [67] OuterVariable1 := i;
2.         movq    -8(%rbp),%rax
3.         movq    -8(%rax),%rax
4.         movq    -8(%rax),%rax
5.         movq    -8(%rax),%rax
6.         movq    -16(%rbp),%rdx
7.         movq    %rdx,-8(%rax)
8.

And in flat nested
Code: Pascal  [Select][+][-]
1. # [171] OuterVariable1 := i;
2.         movq    -8(%rbp),%rax
3.         movq    -16(%rbp),%rdx
4.         movq    %rdx,-8(%rax)
5.

reducing the "movq   -8(%rbp),%rax" to one quarter.

FPC 3.3.1  / 64 bit windows
deep nest / flat nest
-O1       2.6 seconds / 1.7 seconds
-O4       1.9               /  0.7

FPC 3.0.4  / 64 bit windows
-O1       2.6 seconds / 1.8 seconds

Still a difference.
It about  half as fast (factor 2.5 with O4). Rather that factor ~17.
Of course it keeps one level of stack walking. But even if that was eliminated,  I doubt that makes factor 10 ?

No idea why you get such different numbers

Your un-nested (global var), on my machine
FPC 3.0.4  / 64 bit windows
-O1      1.8
-O4      0.7

So that is about the same speed as all procs with 1 level nested.

- all tests done with BILLION
- intel I7 8700K
- timings by gettickcount64
- all runs done twice or more, to spot any exceptions (none found)

« Last Edit: May 25, 2020, 01:36:24 am by Martin_fr »

#### 440bx

• Hero Member
• Posts: 2047
##### Re: Optimization suggestion for nested variable access
« Reply #13 on: May 25, 2020, 01:48:38 am »
You are comparing global vs local vars.
That was the point of the test.  To measure the overhead of walking the stack chain to reach the variables.

As the nesting level increases, so does the performance loss.  Walking the stack chain only once upon entry to the function/procedure would eliminate that loss.  Note, it would be still be a bit slower because there would always be one (1) additional access through the cached frame pointer but, that loss would be small and, it would be almost fully independent of the nesting level.  Almost fully, because the time consumed by the initial walk upon entry depends on the nesting level.

FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

#### Martin_fr

• Hero Member
• Posts: 6708
• Debugger - SynEdit - and more
##### Re: Optimization suggestion for nested variable access
« Reply #14 on: May 25, 2020, 02:21:06 am »
You are comparing global vs local vars.
That was the point of the test.  To measure the overhead of walking the stack chain to reach the variables.

As the nesting level increases, so does the performance loss.  Walking the stack chain only once upon entry to the function/procedure would eliminate that loss.  Note, it would be still be a bit slower because there would always be one (1) additional access through the cached frame pointer but, that loss would be small and, it would be almost fully independent of the nesting level.  Almost fully, because the time consumed by the initial walk upon entry depends on the nesting level.

Well, yes I wasn't 100% correct either.

Access to global vars is different even to accessing none-nested locals (though probably not much of a time diff).
Therefore both tests would need locals. One would need nested, and the other un-nested.

But missing the other point.
Even doing the exact same test as you did, I do not get the same results: For me the factor was 2 or 2.5 instead of 17.