Recent

Author Topic: Optimization suggestion for nested variable access  (Read 9700 times)

440bx

  • Hero Member
  • *****
  • Posts: 3944
Re: Optimization suggestion for nested variable access
« Reply #30 on: May 27, 2020, 12:06:19 am »
Strongly doubt.
I believe you.

Level one - the usual procedures that are most often used.
normal stuff.

Level two - nested procedure. The convenience of the pascal language, which some languages lack. Allows you to hide the implementation, reducing the code's visibility, without losing clarity.
and it's not just about hiding the implementation.  It is a way of letting the programmer know that those functions/procedures are not used anywhere else.  That is real purpose of nested functions/procedures: to declare and control their scope and visibility.  IOW, not create binary jigsaw puzzles (i.e, everything is a "first level" procedure.)

If the size of the level two procedure is large, then in principle you can use level three. But in practice, it is more convenient to do several procedures at level 2 with rare exception is when there is an interaction between procedures that needs to be hidden.
It's not just about size.  If a sequence of code is used _only_ in a particular function/procedure and it is used more than once then, it should be a nested procedure that is above the scope of its callers and, that rule is recursive.

But this is usually very complex code
Complex code has that tendency but, it can also be simple code that has an obvious logical hierarchy and code sequences in common among the hierarchy levels.

and sometimes it is more useful to convert it into classes or make several first-level procedures.
As I am sure you already know, before I convert a procedure or function to a class, I will issue the ultimate conversion command: "del <sourcefilename.pas>... result guaranteed to be clean and bug-free.  Making them first level procedures is correct only if those functions/procedures are general in nature (utilitarian, like formatting an integer and stuff like that), if they are specific and used only by one sequence of code in the program then nesting is the correct implementation.

Level four usually talks about a very crumpled code and honestly, I've never seen this before.
Our opinions differ about that.  When I see four levels, I see that the programmer made an effort to organize and structure the code.  I admit, I didn't use to see it very often and since OOP, it's become almost as rare as a unicorn with two tails.
(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

JernejL

  • Jr. Member
  • **
  • Posts: 92
Re: Optimization suggestion for nested variable access
« Reply #31 on: May 27, 2020, 06:23:06 am »
I have a really big game AI FSM machine - a lot of code in my game which i designed with nested subroutines so they could share variables.
Does this apply always, or can i just mark them as inline and that would solve it?
 

440bx

  • Hero Member
  • *****
  • Posts: 3944
Re: Optimization suggestion for nested variable access
« Reply #32 on: May 27, 2020, 07:14:54 am »
Does this apply always, or can i just mark them as inline and that would solve it?
Marking them "inline" _might_ help.  _Might_ because what it really depends on is how many scopes above the current one the referenced variable is located. 

After seeing how much more efficient new processors are at walking the stack chain than older processors, I don't think it's justified to change the code in any way to attempt a "manual" optimization.  It would still be nice if the compiler did the optimization but, on newer processors that optimization isn't nearly as important as it used to be.

I use nested functions/procedures extensively and, if I inline a function/procedure, it's only because it is used only once, not because it might shorten the stack frames chain.


(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

PascalDragon

  • Hero Member
  • *****
  • Posts: 5446
  • Compiler Developer
Re: Optimization suggestion for nested variable access
« Reply #33 on: May 27, 2020, 09:31:03 am »
I have a really big game AI FSM machine - a lot of code in my game which i designed with nested subroutines so they could share variables.
Does this apply always, or can i just mark them as inline and that would solve it?

If you have simple functions that for example simply do a calculation based on some variables of the parent frame then you might benefit from inlining. You'll need to measure it to see whether the performance gain outweighs the increase in binary size.

ASBzone

  • Hero Member
  • *****
  • Posts: 678
  • Automation leads to relaxation...
    • Free Console Utilities for Windows (and a few for Linux) from BrainWaveCC
Re: Optimization suggestion for nested variable access
« Reply #34 on: May 27, 2020, 01:49:40 pm »
That's the exact same code I get and it takes a tad over 17 seconds to run on my machine.  I admit to being very surprised that current processors managed to improve on the i860 that much.  It's an impressive improvement.

Thank you for testing it.

Interestingly enough, here's what I get:

Compiled on ... FPC 3.2.0-beta-r45317  64-bit  (-O4)
Hardware ...... Intel(R) Core(TM) i5-2410M CPU at 2.30GHz
OS ............ Windows 10 1909 x64

Runtimes of the executable on various systems:

21.183 seconds -- Above system
15.723 seconds -- Intel(R) Xeon(R) CPU E31235 at 3.20GHz
18.048 seconds -- Intel(R) Core(TM) i5-7300HQ CPU at 2.50GHz
14.207 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]

Of course, you do know that Intel is paying for *some* of that branch prediction advantage by way of security issues... :)

I need to test this on some more processors and see.   I don't have any newer ones to try it on as yet.
-ASB: https://www.BrainWaveCC.com/

Lazarus v2.2.7-ada7a90186 / FPC v3.2.3-706-gaadb53e72c
(Windows 64-bit install w/Win32 and Linux/Arm cross-compiles via FpcUpDeluxe on both instances)

My Systems: Windows 10/11 Pro x64 (Current)

ASerge

  • Hero Member
  • *****
  • Posts: 2222
Re: Optimization suggestion for nested variable access
« Reply #35 on: May 27, 2020, 07:43:36 pm »
When I see four levels, I see that the programmer made an effort to organize and structure the code.
You are a rare lover of complications. But I don't recommend it to others. And the phrase about the fourth level and well-organized code - I was looking for a long time, where is the smile. Not found - a difficult case.

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Optimization suggestion for nested variable access
« Reply #36 on: May 27, 2020, 08:16:03 pm »
Hi!

In all those  years of Pascal:  I never saw a fourth level.
And I never used it .

So the fourth level seems to be a unicorn.

Winni

440bx

  • Hero Member
  • *****
  • Posts: 3944
Re: Optimization suggestion for nested variable access
« Reply #37 on: May 27, 2020, 08:42:31 pm »
21.183 seconds -- Above system
15.723 seconds -- Intel(R) Xeon(R) CPU E31235 at 3.20GHz
18.048 seconds -- Intel(R) Core(TM) i5-7300HQ CPU at 2.50GHz
14.207 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]

I need to test this on some more processors and see.   I don't have any newer ones to try it on as yet.
The results you get are along the lines of what I expect. 

I can't put my finger on it but, executing this code sequence
Code: Pascal  [Select][+][-]
  1. # [67] e1^^^^^ := 1;
  2.         movq    (%rbx),%rax
  3.         movq    (%rax),%rax
  4.         movq    (%rax),%rax
  5.         movq    (%rax),%rax
  6.         movq    $1,(%rax)
  7. .Ll20:
  8.  
3 times (for e1, e2 and e3) in under 3 seconds doesn't add up.  Each of those instructions take on the order of 3 clock cycles, times 5, that's 15, times 3 (e1, e2, e3), that's 45, times a billion, that's 45 billlon.  IOW, a 3.7Ghz processor is magically executing somewhere in the ballpark of 20 billion instructions per second.  It doesn't add up.



You are a rare lover of complications.
I don't think of organization as creating complication, on the contrary.

But I don't recommend it to others.
I don't recommend any particular level depth.  What I "recommend" is for the code structure to be parallel to the problem structure.  The number of levels is irrelevant, the structure of the problem determines the structure of the solution.



In all those  years of Pascal:  I never saw a fourth level.
There are plenty of things that should be commonly seen that are either, never seen or, extremely rarely seen.  Their absence is no indication they are undesirable or their presence an indication of desirability.  I'm sure you've seen a few goto(s) but, I doubt you'd use that as a basis to recommend their use.
(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: Optimization suggestion for nested variable access
« Reply #38 on: May 27, 2020, 10:26:48 pm »
I think that in real applications more than the 3rd level of nesting is not used. And for bottlenecks, level 1 is usually used. Therefore, no one was interested in this type of optimization.

As said, some of the optimizations are not level dependent. Caching an used parent frame in a register is not dependent on nesting depth.

But yeah, somebody that cares has to dive into such things (  ;) 440bx)

ASBzone

  • Hero Member
  • *****
  • Posts: 678
  • Automation leads to relaxation...
    • Free Console Utilities for Windows (and a few for Linux) from BrainWaveCC
Re: Optimization suggestion for nested variable access
« Reply #39 on: May 28, 2020, 04:30:46 am »

I can't put my finger on it but, executing this code sequence
Code: Pascal  [Select][+][-]
  1. # [67] e1^^^^^ := 1;
  2.         movq    (%rbx),%rax
  3.         movq    (%rax),%rax
  4.         movq    (%rax),%rax
  5.         movq    (%rax),%rax
  6.         movq    $1,(%rax)
  7. .Ll20:
  8.  
3 times (for e1, e2 and e3) in under 3 seconds doesn't add up.  Each of those instructions take on the order of 3 clock cycles, times 5, that's 15, times 3 (e1, e2, e3), that's 45, times a billion, that's 45 billlon.  IOW, a 3.7Ghz processor is magically executing somewhere in the ballpark of 20 billion instructions per second.  It doesn't add up.

Well, I have no 9th gen Intel here to test with, nor a Zen2 based Ryzen to test with -- yet.

We'll see when I get that. 
-ASB: https://www.BrainWaveCC.com/

Lazarus v2.2.7-ada7a90186 / FPC v3.2.3-706-gaadb53e72c
(Windows 64-bit install w/Win32 and Linux/Arm cross-compiles via FpcUpDeluxe on both instances)

My Systems: Windows 10/11 Pro x64 (Current)

MathMan

  • Sr. Member
  • ****
  • Posts: 325
Re: Optimization suggestion for nested variable access
« Reply #40 on: May 28, 2020, 02:08:33 pm »
I can't put my finger on it but, executing this code sequence
Code: Pascal  [Select][+][-]
  1. # [67] e1^^^^^ := 1;
  2.         movq    (%rbx),%rax
  3.         movq    (%rax),%rax
  4.         movq    (%rax),%rax
  5.         movq    (%rax),%rax
  6.         movq    $1,(%rax)
  7. .Ll20:
  8.  
3 times (for e1, e2 and e3) in under 3 seconds doesn't add up.  Each of those instructions take on the order of 3 clock cycles, times 5, that's 15, times 3 (e1, e2, e3), that's 45, times a billion, that's 45 billlon.  IOW, a 3.7Ghz processor is magically executing somewhere in the ballpark of 20 billion instructions per second.  It doesn't add up.

Caveat - the following is just a rough breakdown, to really understand the differences between the cores/timings presented so far one would have to analyse processor performance counters.

Nevertheless, from a modern multi-issue CPU's perspective the calculation would look completely different. There are four memory reads and one memory write per variable. The most CPU's mentioned so far have two read-ports and one write port, which can be used in parallel (if enough capacity for address calculation is available). That means that the sequence can be executed in just 2 clock cycles in a continuous model. To really do this several conditions must be met, especially as the code reuses RAX all the way and meeting these conditions really makes the difference between Intel Core-I generations. So from a CPU perspective one iteration of the loop could execute in 3 (variables) * 2 (cycles) = 6 cycles. Then add some overhead (very minor) due to the "if" statement in the loop and you end up at 7-8 cycles per iteration = 7-8 billion cycles for the complete loop. So in theory it is possible to execute this in 2 sec on a 3.7 GHz machine.

Cheers,
MathMan

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: Optimization suggestion for nested variable access
« Reply #41 on: May 28, 2020, 05:16:44 pm »
Nevertheless, from a modern multi-issue CPU's perspective the calculation would look completely different. There are four memory reads and one memory write per variable. The most CPU's mentioned so far have two read-ports and one write port, which can be used in parallel (if enough capacity for address calculation is available). That means that the sequence can be executed in just 2 clock cycles in a continuous model. To really do this several conditions must be met, especially as the code reuses RAX all the way and meeting these conditions really makes the difference between Intel Core-I generations. So from a CPU perspective one iteration of the loop could execute in 3 (variables) * 2 (cycles) = 6 cycles. Then add some overhead (very minor) due to the "if" statement in the loop and you end up at 7-8 cycles per iteration = 7-8 billion cycles for the complete loop. So in theory it is possible to execute this in 2 sec on a 3.7 GHz machine.

If the loads are not independent, e.g. some of those are in the same cache line, do they still occupy a read port each?

cachelines have 64byte nowadays, and in these rather parameterless procedures, everything on the stack would be quite close together.

440bx

  • Hero Member
  • *****
  • Posts: 3944
Re: Optimization suggestion for nested variable access
« Reply #42 on: May 28, 2020, 08:32:13 pm »
So from a CPU perspective one iteration of the loop could execute in 3 (variables) * 2 (cycles) = 6 cycles. Then add some overhead (very minor) due to the "if" statement in the loop and you end up at 7-8 cycles per iteration = 7-8 billion cycles for the complete loop. So in theory it is possible to execute this in 2 sec on a 3.7 GHz machine.
I see what you're saying but, this is what I cannot reconcile in my mind:

There are 3 such sequences, therefore 3 * 5 (instructions per sequence) = 15, one "if" at, let's say 1 clock cycle, that makes it 16, one "for" index test, again at 1 clock cycle, that's 17 and, there is a jump to implement the loop, another clock cycle, for a total of 18.  That's done a billion times, for a total of 18 billion cycles.  At 3.7Ghz, that would be 18/3.7 = 4.86 seconds.

Getting execution time to be under 4.86 seconds is conceivable but, getting it to be less than _half_ that time, that's the part I am having a hard time with. 
 
(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9791
  • Debugger - SynEdit - and more
    • wiki
Re: Optimization suggestion for nested variable access
« Reply #43 on: May 28, 2020, 09:27:03 pm »
Ok, so below is the generated asm by fpc 3.3.1 (approx 1 or 2 month old), with -O4.

1 Billion loops.
This takes 1.937 secs. (tickcount64)

while it still does not help the result of your calc, an I7 8700K runs at up to 4.7 GHz (since the code is not multithreaded.).
I did not monitor, and as other software is running in the background it may have been at 4.6 or 4.5.

Your "18 div freq" would then be 18/4.7 = 3.8. Somehow that is exactly 2*1.9. So somehow the CPU must have optimized it further....
(though there are more than 18 asm statements)

According to https://easyperf.net/blog/2018/03/21/port-contention it seems to me, that the CPU kind of processes up to 2 instructions per time? (I did not spent to much time reading it, may have gotten it wrong)


Code: ASM  [Select][+][-]
  1. # [43] for i := 1 to VariableLimit do
  2.         movq    U_$P$TESTNESTEDIF_$$_VARIABLELIMIT(%rip),%rdi
  3.         cmpq    $1,%rdi
  4.         jnae    .Lj23
  5. # Peephole Optimization: xorq %rsi,%rsi -> xorl %esi,%esi (removes REX prefix)
  6.         xorl    %esi,%esi
  7.         .p2align 4,,10
  8.         .p2align 3
  9. .Lj24:
  10.         addq    $1,%rsi
  11. .Ll40:
  12. # [45] OuterVariable1 := i;
  13.         movq    32(%rsp),%rax
  14.         movq    32(%rax),%rax
  15.         movq    32(%rax),%rax
  16.         movq    32(%rax),%rax
  17.         movq    %rsi,32(%rax)
  18. .Ll41:
  19. # [46] OuterVariable2 := i;
  20.         movq    32(%rsp),%rax
  21.         movq    32(%rax),%rax
  22.         movq    32(%rax),%rax
  23.         movq    32(%rax),%rax
  24.         movq    %rsi,40(%rax)
  25. .Ll42:
  26. # [47] OuterVariable3 := i;
  27.         movq    32(%rsp),%rax
  28.         movq    32(%rax),%rax
  29.         movq    32(%rax),%rax
  30.         movq    32(%rax),%rax
  31.         movq    %rsi,48(%rax)
  32. .Ll43:
  33. # [49] if i >= VariableLimit then
  34.         cmpq    U_$P$TESTNESTEDIF_$$_VARIABLELIMIT(%rip),%rsi
  35.         jnae    .Lj28
  36. .Ll44:
  37. # [51] writeln('Variable limit - Loop DONE ', i);
  38.         call    fpc_get_output
  39.         movq    %rax,%rbx
  40.         leaq    _$TESTNESTEDIF$_Ld8(%rip),%r8
  41. # Peephole Optimization: %rbx = %rax; changed to minimise pipeline stall (MovMov2Mov 6a}
  42.         movq    %rax,%rdx
  43.         xorl    %ecx,%ecx
  44.         call    fpc_write_text_shortstr
  45.         call    fpc_iocheck
  46.         movq    %rsi,%r8
  47.         movq    %rbx,%rdx
  48.         xorl    %ecx,%ecx
  49.         call    fpc_write_text_uint
  50.         call    fpc_iocheck
  51.         movq    %rbx,%rcx
  52.         call    fpc_writeln_end
  53.         call    fpc_iocheck
  54. .Ll45:
  55. # [52] writeln;
  56.         call    fpc_get_output
  57.         movq    %rax,%rcx
  58.         call    fpc_writeln_end
  59.         call    fpc_iocheck
  60. .Lj28:
  61. .Ll46:
  62.         cmpq    %rdi,%rsi
  63.         jnae    .Lj24
  64. .Lj23:
  65. .Ll47:
  66. # [55] end;
  67.  
« Last Edit: May 28, 2020, 09:29:00 pm by Martin_fr »

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9791
  • Debugger - SynEdit - and more
    • wiki
Re: Optimization suggestion for nested variable access
« Reply #44 on: May 28, 2020, 09:49:35 pm »
According to https://easyperf.net/blog/2018/03/21/port-contention it seems to me, that the CPU kind of processes up to 2 instructions per time? (I did not spent to much time reading it, may have gotten it wrong)

So according to https://en.wikipedia.org/wiki/Instructions_per_second

Look for the 8086 => afaik same chip, just silicon lottery selection for higher frequency:  7.39 IPC (for a single core).

IPC is per cycle, so frequency does not matter.
However that single core still has hyper threading.  So running only one thread, and my I7 8700k should do 3.69 IPC. 
(not exact, since hyper-threading would be slower, than 2 full real cores / so the value may be bigger for single thread)


 

TinyPortal © 2005-2018