i think fpc couldn't inlined nested proc/func , so stack frame is always built in each callThe fact that it didn't inline is fine. The problem is that the compiler walks the stack frames more than is necessary. One walk at procedure entry is all it should need, not one per reference.
Hello,
Following the discussion in this thread, https://forum.lazarus.freepascal.org/index.php/topic,49459.0.html
I looked into FPC's "stack flattening" abilities. To make a short story really short: FPC does not "flatten" nested procedure/function stacks (that's no surprise, no one claimed it did.)
However, there is a simple optimization in that case that FPC could do. Consider the following small program:When compiled with -O4 (64bit, Windows target), it produces the following code (abbreviated):
{$APPTYPE CONSOLE} program TestNestedVariableAccess; { level 1 } procedure OuterProcedure(); { level 2 } var OuterVariable1, OuterVariable2, OuterVariable3 : integer; procedure Level3(); { level 3 } procedure Level4(); { level 4 } procedure Level5(); { level 5 } procedure Level6(); { level 6 } begin writeln('Level6'); OuterVariable1 := 1; OuterVariable2 := 2; OuterVariable3 := 3; writeln('OuterVariable1 : ', OuterVariable1); writeln('OuterVariable2 : ', OuterVariable2); writeln('OuterVariable3 : ', OuterVariable3); end; begin writeln('Level5'); Level6(); end; begin writeln('Level4'); Level5(); end; begin writeln('level 3'); Level4(); end; begin Level3(); end; begin OuterProcedure(); end.Basically, the compiler is walking the stack chain for every variable reference in that scope. It could walk it once upon entry to the Level6() procedure to have a pointer to that target stack frame of interest (in this case, OuterProcedure()'s stack frame) and use that pointer instead of walking the stack for every reference.
TestNestedVariableAccess.lpr:19 OuterVariable1 := 1; ; walk the frame pointers 00000001000015A9 488b442420 mov 0x20(%rsp),%rax 00000001000015AE 488b4020 mov 0x20(%rax),%rax 00000001000015B2 488b4020 mov 0x20(%rax),%rax 00000001000015B6 488b4020 mov 0x20(%rax),%rax 00000001000015BA c7402001000000 movl $0x1,0x20(%rax) TestNestedVariableAccess.lpr:20 OuterVariable2 := 2; ; walk the frame pointers (again) 00000001000015C1 488b442420 mov 0x20(%rsp),%rax 00000001000015C6 488b4020 mov 0x20(%rax),%rax 00000001000015CA 488b4020 mov 0x20(%rax),%rax 00000001000015CE 488b4020 mov 0x20(%rax),%rax 00000001000015D2 c7402802000000 movl $0x2,0x28(%rax) TestNestedVariableAccess.lpr:21 OuterVariable3 := 3; ; walk the frame pointers (and again - a road well traveled by now!) 00000001000015D9 488b442420 mov 0x20(%rsp),%rax 00000001000015DE 488b4020 mov 0x20(%rax),%rax 00000001000015E2 488b4020 mov 0x20(%rax),%rax 00000001000015E6 488b4020 mov 0x20(%rax),%rax 00000001000015EA c7403003000000 movl $0x3,0x30(%rax)
I just thought I'd mention it. It's a fairly simple optimization.
Seems a task for CSE. Either the CSE misses it, or this is not within the scope of the CSE.Just ran that through D, the output code is totally different.
Delphi doesn't optimize this either.
It could walk it once upon entry to the Level6() procedure to have a pointer to that target stack frame of interest (in this case, OuterProcedure()'s stack frame) and use that pointer instead of walking the stack for every reference.
I just thought I'd mention it. It's a fairly simple optimization.
Seems a task for CSE. Either the CSE misses it, or this is not within the scope of the CSE.Just ran that through D, the output code is totally different.
Delphi doesn't optimize this either.
If my memory serves me correctly, that sort of frame cache is sometimes referred to as a "display". I think I've used a (Modula-2) compiler that did this since some targets weren't as good at frame chasing as x86, and I believe I've seen it in hardware as well.You got that right. Some machines had hardware support to keep track of the display (a rather nice feature to have in hardware.)
I just ran it through my Delphi 4This is speculation but, I suspect that part of the reason that PC class Pascal compilers don't optimize nested stack frame accesses is because of OOP. I believe (just a belief) that nested functions/procedures became less common when Pascal got "OOP-ed".
It gives the same construction
So at least fpc is not worse than a 20 year old compiler
If my memory serves me correctly, that sort of frame cache is sometimes referred to as a "display". I think I've used a (Modula-2) compiler that did this since some targets weren't as good at frame chasing as x86, and I believe I've seen it in hardware as well.You got that right. Some machines had hardware support to keep track of the display (a rather nice feature to have in hardware.)
Without hardware support, walking the chain of stack frames is usually less overhead than maintaining a complete display in software (N. Wirth ran into that.) When doing it in software, which is what FPC does, it's important for the compiler to implement a limited display structure (a simple cache of selected stack frame pointers) to prevent having to walk the stack chain for every variable reference. That's a simple optimization too and, depending on the number of references and the nesting depth, it can make a noticeable difference.
This is speculation but, I suspect that part of the reason that PC class Pascal compilers don't optimize nested stack frame accesses is because of OOP. I believe (just a belief) that nested functions/procedures became less common when Pascal got "OOP-ed".
http://www.cs.berkeley.edu/~culler/courses/cs252-s05/papers/burroughs.pdf figure 5 etc.Good stuff in that pdf. The HTML page was a nice walk down the binary memory lane. :)
http://users.monash.edu.au/~ralphk/burroughs.html (note the size of the disc drive platter halfway down that page)
Looking at it from a slightly different viewpoint, it would be interesting to know how well the CPU caches the stack, and whether it could be encouraged to handle a per-thread block earmarked as the display in the most effective way possible.Yes, that is definitely something worth knowing. To get an idea, I wrote the following two simple test programs:
While there is some truth in it, it is an overgeneralizationIt was speculation anyway. I really don't know if it is an over-generalization or not but, I do believe that since classes/objects' data is accessed with just one de-reference, that is a disincentive for the compiler implementors to optimize nested variables accesses because the mantra has become "make it an object".
You are comparing global vs local vars.That was the point of the test. To measure the overhead of walking the stack chain to reach the variables.
You are comparing global vs local vars.That was the point of the test. To measure the overhead of walking the stack chain to reach the variables.
As the nesting level increases, so does the performance loss. Walking the stack chain only once upon entry to the function/procedure would eliminate that loss. Note, it would be still be a bit slower because there would always be one (1) additional access through the cached frame pointer but, that loss would be small and, it would be almost fully independent of the nesting level. Almost fully, because the time consumed by the initial walk upon entry depends on the nesting level.
Access to global vars is different even to accessing none-nested locals (though probably not much of a time diff).Fair enough and, I even forgot to test with local variables (tested only against globals.) That's also why the "local variables" test is missing the local variables <chuckle>. Here is the corrected test program that includes the local variables:
Therefore both tests would need locals. One would need nested, and the other un-nested.
Even doing the exact same test as you did, I do not get the same results: For me the factor was 2 or 2.5 instead of 17.I just run the two programs, nothing else.
How to get your numbers?
If you post the code you're using, maybe I'll get results similar to yours and it would give a clue as to why your results are different (proportionally) than mine.
Both programs compiled with -O4 for Windows 64 bit, "smaller rather than faster" _un_checked.
Hardware is an i860 @ 2.8Ghz running Win7
ETA:
FPC v3.0.4
If you post the code you're using, maybe I'll get results similar to yours and it would give a clue as to why your results are different (proportionally) than mine.
Both programs compiled with -O4 for Windows 64 bit, "smaller rather than faster" _un_checked.
Hardware is an i860 @ 2.8Ghz running Win7
ETA:
FPC v3.0.4
I run the 2 exact versions that you have posted.
For the 3rd version that I run: I took your code, and simply (by copy and paste) moved each deeply nested proc, so it ended up directly nested into OuterProc. I did not keep the source, but it is easy to generate.
It appears that the cost for walking the stack has significantly gone down with more modern hardware. At least that is the only diff I can spot from the descriptions.
The other possible explanation could be that the code difference influences jump prediction (i.e. your cpu was worst affected by how well it predicted jump/branchen).
The cpu's ability to correctly forecast branching, can have severe effect on the timing. And as the test is based on a loop, this could be the cause for the time difference.
How does the timing change, if you for example "unroll" (by hand) 50 loop iterations, and run the loop "billion div 50" times ?
It appears that the cost for walking the stack has significantly gone down with more modern hardware. At least that is the only diff I can spot from the descriptions.I have a hard time believing that those things could account for a difference that large between the results.
The other possible explanation could be that the code difference influences jump prediction (i.e. your cpu was worst affected by how well it predicted jump/branchen).
The cpu's ability to correctly forecast branching, can have severe effect on the timing. And as the test is based on a loop, this could be the cause for the time difference.
I'd like you to run this little test program on your machine and report the times:
O4:That's the exact same code I get and it takes a tad over 17 seconds to run on my machine. I admit to being very surprised that current processors managed to improve on the i860 that much. It's an impressive improvement.
# [67] e1^^^^^ := 1; movq (%rbx),%rax movq (%rax),%rax movq (%rax),%rax movq (%rax),%rax movq $1,(%rax) .Ll20:
PointerDereference.pas(63,9) Error: (4007) Ordinal expression expected
My 64 bit 3.0.4 did the job.
The limit for ordinals....? You know that..My 64 bit 3.0.4 did the job.
Oh. That might be it: it's OK in 64 but not in 32 bits. Can someone confirm? It would be good to know for sure ... :-\
The limit for ordinals....? You know that..
Remark: Int64 and QWord are considered ordinal types on 64-bit CPUs. On 32-bit types they have some of the characteristics of ordinals, but they cannot be used e.g. in for loops.
Yes, that's correct. The loop counter can be a qword only when compiling for 64 bit because it is not considered an ordinal type in 32 bit.My 64 bit 3.0.4 did the job.
Oh. That might be it: it's OK in 64 but not in 32 bits. Can someone confirm? It would be good to know for sure ... :-\
I think that in real applications more than the 3rd level of nesting is not used. And for bottlenecks, level 1 is usually used. Therefore, no one was interested in this type of optimization.That's probably true in OOP. In structured programming, it's not unusual at all to exceed 3 levels and, aside from programming style, when there is recursion the number of levels can be very large (but they don't show in the code structure.)
In structured programming, it's not unusual at all to exceed 3 levels and, aside from programming style, when there is recursion the number of levels can be very large (but they don't show in the code structure.)Strongly doubt.
Strongly doubt.I believe you.
Level one - the usual procedures that are most often used.normal stuff.
Level two - nested procedure. The convenience of the pascal language, which some languages lack. Allows you to hide the implementation, reducing the code's visibility, without losing clarity.and it's not just about hiding the implementation. It is a way of letting the programmer know that those functions/procedures are not used anywhere else. That is real purpose of nested functions/procedures: to declare and control their scope and visibility. IOW, not create binary jigsaw puzzles (i.e, everything is a "first level" procedure.)
If the size of the level two procedure is large, then in principle you can use level three. But in practice, it is more convenient to do several procedures at level 2 with rare exception is when there is an interaction between procedures that needs to be hidden.It's not just about size. If a sequence of code is used _only_ in a particular function/procedure and it is used more than once then, it should be a nested procedure that is above the scope of its callers and, that rule is recursive.
But this is usually very complex codeComplex code has that tendency but, it can also be simple code that has an obvious logical hierarchy and code sequences in common among the hierarchy levels.
and sometimes it is more useful to convert it into classes or make several first-level procedures.As I am sure you already know, before I convert a procedure or function to a class, I will issue the ultimate conversion command: "del <sourcefilename.pas>... result guaranteed to be clean and bug-free. Making them first level procedures is correct only if those functions/procedures are general in nature (utilitarian, like formatting an integer and stuff like that), if they are specific and used only by one sequence of code in the program then nesting is the correct implementation.
Level four usually talks about a very crumpled code and honestly, I've never seen this before.Our opinions differ about that. When I see four levels, I see that the programmer made an effort to organize and structure the code. I admit, I didn't use to see it very often and since OOP, it's become almost as rare as a unicorn with two tails.
Does this apply always, or can i just mark them as inline and that would solve it?Marking them "inline" _might_ help. _Might_ because what it really depends on is how many scopes above the current one the referenced variable is located.
I have a really big game AI FSM machine - a lot of code in my game which i designed with nested subroutines so they could share variables.
Does this apply always, or can i just mark them as inline and that would solve it?
That's the exact same code I get and it takes a tad over 17 seconds to run on my machine. I admit to being very surprised that current processors managed to improve on the i860 that much. It's an impressive improvement.
Thank you for testing it.
When I see four levels, I see that the programmer made an effort to organize and structure the code.You are a rare lover of complications. But I don't recommend it to others. And the phrase about the fourth level and well-organized code - I was looking for a long time, where is the smile. Not found - a difficult case.
21.183 seconds -- Above systemThe results you get are along the lines of what I expect.
15.723 seconds -- Intel(R) Xeon(R) CPU E31235 at 3.20GHz
18.048 seconds -- Intel(R) Core(TM) i5-7300HQ CPU at 2.50GHz
14.207 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]
I need to test this on some more processors and see. I don't have any newer ones to try it on as yet.
You are a rare lover of complications.I don't think of organization as creating complication, on the contrary.
But I don't recommend it to others.I don't recommend any particular level depth. What I "recommend" is for the code structure to be parallel to the problem structure. The number of levels is irrelevant, the structure of the problem determines the structure of the solution.
In all those years of Pascal: I never saw a fourth level.There are plenty of things that should be commonly seen that are either, never seen or, extremely rarely seen. Their absence is no indication they are undesirable or their presence an indication of desirability. I'm sure you've seen a few goto(s) but, I doubt you'd use that as a basis to recommend their use.
I think that in real applications more than the 3rd level of nesting is not used. And for bottlenecks, level 1 is usually used. Therefore, no one was interested in this type of optimization.
I can't put my finger on it but, executing this code sequence3 times (for e1, e2 and e3) in under 3 seconds doesn't add up. Each of those instructions take on the order of 3 clock cycles, times 5, that's 15, times 3 (e1, e2, e3), that's 45, times a billion, that's 45 billlon. IOW, a 3.7Ghz processor is magically executing somewhere in the ballpark of 20 billion instructions per second. It doesn't add up.
# [67] e1^^^^^ := 1; movq (%rbx),%rax movq (%rax),%rax movq (%rax),%rax movq (%rax),%rax movq $1,(%rax) .Ll20:
I can't put my finger on it but, executing this code sequence3 times (for e1, e2 and e3) in under 3 seconds doesn't add up. Each of those instructions take on the order of 3 clock cycles, times 5, that's 15, times 3 (e1, e2, e3), that's 45, times a billion, that's 45 billlon. IOW, a 3.7Ghz processor is magically executing somewhere in the ballpark of 20 billion instructions per second. It doesn't add up.
# [67] e1^^^^^ := 1; movq (%rbx),%rax movq (%rax),%rax movq (%rax),%rax movq (%rax),%rax movq $1,(%rax) .Ll20:
Nevertheless, from a modern multi-issue CPU's perspective the calculation would look completely different. There are four memory reads and one memory write per variable. The most CPU's mentioned so far have two read-ports and one write port, which can be used in parallel (if enough capacity for address calculation is available). That means that the sequence can be executed in just 2 clock cycles in a continuous model. To really do this several conditions must be met, especially as the code reuses RAX all the way and meeting these conditions really makes the difference between Intel Core-I generations. So from a CPU perspective one iteration of the loop could execute in 3 (variables) * 2 (cycles) = 6 cycles. Then add some overhead (very minor) due to the "if" statement in the loop and you end up at 7-8 cycles per iteration = 7-8 billion cycles for the complete loop. So in theory it is possible to execute this in 2 sec on a 3.7 GHz machine.
So from a CPU perspective one iteration of the loop could execute in 3 (variables) * 2 (cycles) = 6 cycles. Then add some overhead (very minor) due to the "if" statement in the loop and you end up at 7-8 cycles per iteration = 7-8 billion cycles for the complete loop. So in theory it is possible to execute this in 2 sec on a 3.7 GHz machine.I see what you're saying but, this is what I cannot reconcile in my mind:
According to https://easyperf.net/blog/2018/03/21/port-contention it seems to me, that the CPU kind of processes up to 2 instructions per time? (I did not spent to much time reading it, may have gotten it wrong)
Your "18 div freq" would then be 18/4.7 = 3.8. Somehow that is exactly 2*1.9. So somehow the CPU must have optimized it further....
(though there are more than 18 asm statements)
So from a CPU perspective one iteration of the loop could execute in 3 (variables) * 2 (cycles) = 6 cycles. Then add some overhead (very minor) due to the "if" statement in the loop and you end up at 7-8 cycles per iteration = 7-8 billion cycles for the complete loop. So in theory it is possible to execute this in 2 sec on a 3.7 GHz machine.I see what you're saying but, this is what I cannot reconcile in my mind:
There are 3 such sequences, therefore 3 * 5 (instructions per sequence) = 15, one "if" at, let's say 1 clock cycle, that makes it 16, one "for" index test, again at 1 clock cycle, that's 17 and, there is a jump to implement the loop, another clock cycle, for a total of 18. That's done a billion times, for a total of 18 billion cycles. At 3.7Ghz, that would be 18/3.7 = 4.86 seconds.
Getting execution time to be under 4.86 seconds is conceivable but, getting it to be less than _half_ that time, that's the part I am having a hard time with.
Nevertheless, from a modern multi-issue CPU's perspective the calculation would look completely different. There are four memory reads and one memory write per variable. The most CPU's mentioned so far have two read-ports and one write port, which can be used in parallel (if enough capacity for address calculation is available). That means that the sequence can be executed in just 2 clock cycles in a continuous model. To really do this several conditions must be met, especially as the code reuses RAX all the way and meeting these conditions really makes the difference between Intel Core-I generations. So from a CPU perspective one iteration of the loop could execute in 3 (variables) * 2 (cycles) = 6 cycles. Then add some overhead (very minor) due to the "if" statement in the loop and you end up at 7-8 cycles per iteration = 7-8 billion cycles for the complete loop. So in theory it is possible to execute this in 2 sec on a 3.7 GHz machine.
If the loads are not independent, e.g. some of those are in the same cache line, do they still occupy a read port each?
cachelines have 64byte nowadays, and in these rather parameterless procedures, everything on the stack would be quite close together.
Current CPU's are multi-issue - that is a single core can execute 2 (860 iirc), 3 (Core-I 2 Nehalem)) or even 4 (everything from Haswell onward) instructions per single clock.I understand that but, that normally requires the sequence of instructions to be fairly (or totally) independent of each other. In a sequence of instructions where one instruction depends on the result of the previous one, the CPU's ability to parallelize those instructions is severely compromised, particularly when the value to be used is not a calculated value but, a value retrieved from a yet to be known pointer.
I accept that _somehow_ an i8700k manages to execute that sequence in less than half the time the logical dependencies among the instructions requires but, I don't see how. I'm sure some - probably many of the things you've mentioned - come into play but, I do not see how they could be sufficient. It's like they managed to make clairvoyant CPUs that know the result before it's even calculated. That sequence of instructions is hard (and "hard" is an understatement in this particular case) to parallelize even with highly speculative execution.
Current CPU's are multi-issue - that is a single core can execute 2 (860 iirc), 3 (Core-I 2 Nehalem)) or even 4 (everything from Haswell onward) instructions per single clock.
I need to test this on some more processors and see. I don't have any newer ones to try it on as yet.
And then I tested some more and realized that the performance gap has everything to do with the various Spectre/Meltdown mitigations installed on the systems. When I disabled them, the performance changed dramatically.Thank you for reporting that.
Even for the AMD systems.
OLD: 14.207 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]
NEW: 02.882 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]
And then I tested some more and realized that the performance gap has everything to do with the various Spectre/Meltdown mitigations installed on the systems. When I disabled them, the performance changed dramatically.Thank you for reporting that.
Even for the AMD systems.
OLD: 14.207 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]
NEW: 02.882 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]
Maybe a sequence of instructions that is subject to multi-stage speculative execution could be used to determine if the mitigations are present or absent. <evil grin>
First of all I must say, that I am surprised that AMD cores seem to be equivalently affected by Spectre/Meltdown mitigation under Win10. Looks like MS is activating full mitigation regardless of CPU type.
Hope that made this a bit clearer.In my efforts to see/understand how they managed to make that degree of parallelism occur in that sequence of instructions, I failed to thank you for your efforts to explain it. Therefore, here is the thank you I owed you. Thank you!.