If you post the code you're using, maybe I'll get results similar to yours and it would give a clue as to why your results are different (proportionally) than mine.
Both programs compiled with -O4 for Windows 64 bit, "smaller rather than faster" _un_checked.
Hardware is an i860 @ 2.8Ghz running Win7
ETA:
FPC v3.0.4
I run the 2 exact versions that you have posted.
For the 3rd version that I run: I took your code, and simply (by copy and paste) moved each deeply nested proc, so it ended up directly nested into OuterProc. I did not keep the source, but it is easy to generate.
It appears that the cost for walking the stack has significantly gone down with more modern hardware. At least that is the only diff I can spot from the descriptions.
The other possible explanation could be that the code difference influences jump prediction (i.e. your cpu was worst affected by how well it predicted jump/branchen).
The cpu's ability to correctly forecast branching, can have severe effect on the timing. And as the test is based on a loop, this could be the cause for the time difference.
How does the timing change, if you for example "unroll" (by hand) 50 loop iterations, and run the loop "billion div 50" times ?