Recent

Author Topic: Optimization suggestion for nested variable access  (Read 3607 times)

MarkMLl

  • Hero Member
  • *****
  • Posts: 1221
Re: Optimization suggestion for nested variable access
« Reply #45 on: May 28, 2020, 09:53:28 pm »
Your "18 div freq" would then be 18/4.7 = 3.8. Somehow that is exactly 2*1.9. So somehow the CPU must have optimized it further....
(though there are more than 18 asm statements)

But based on my reading around when the Intel uCode flaws were forced into the open, those 18+ x86 or x86_64 opcodes would have been translated internally into triads of simpler instructions, which themselves would be optimised and swapped around at the hardware level. And AIUI there are various trace buffers etc. which mean that the lowest level optimisations can evolve with time.

MarkMLl
Turbo Pascal v1 on CCP/M-86, multitasking with LAN and graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.

MathMan

  • Full Member
  • ***
  • Posts: 205
Re: Optimization suggestion for nested variable access
« Reply #46 on: May 29, 2020, 08:07:37 am »
So from a CPU perspective one iteration of the loop could execute in 3 (variables) * 2 (cycles) = 6 cycles. Then add some overhead (very minor) due to the "if" statement in the loop and you end up at 7-8 cycles per iteration = 7-8 billion cycles for the complete loop. So in theory it is possible to execute this in 2 sec on a 3.7 GHz machine.
I see what you're saying but, this is what I cannot reconcile in my mind:

There are 3 such sequences, therefore 3 * 5 (instructions per sequence) = 15, one "if" at, let's say 1 clock cycle, that makes it 16, one "for" index test, again at 1 clock cycle, that's 17 and, there is a jump to implement the loop, another clock cycle, for a total of 18.  That's done a billion times, for a total of 18 billion cycles.  At 3.7Ghz, that would be 18/3.7 = 4.86 seconds.

Getting execution time to be under 4.86 seconds is conceivable but, getting it to be less than _half_ that time, that's the part I am having a hard time with. 

Current CPU's are multi-issue - that is a single core can execute 2 (860 iirc), 3 (Core-I 2 Nehalem)) or even 4 (everything from Haswell onward) instructions per single clock. That might be the point resolving that mental barrier. Again - reaching this limit on actual code is tough, due to various internal limitations of a CPU. But it is feasable - I have coded some arithmetic functions that actually reach 3.5 instructions per clock (IPC) sustained. But I just see that Martin also explained the link to IPC.

MathMan

  • Full Member
  • ***
  • Posts: 205
Re: Optimization suggestion for nested variable access
« Reply #47 on: May 29, 2020, 08:13:47 am »
Nevertheless, from a modern multi-issue CPU's perspective the calculation would look completely different. There are four memory reads and one memory write per variable. The most CPU's mentioned so far have two read-ports and one write port, which can be used in parallel (if enough capacity for address calculation is available). That means that the sequence can be executed in just 2 clock cycles in a continuous model. To really do this several conditions must be met, especially as the code reuses RAX all the way and meeting these conditions really makes the difference between Intel Core-I generations. So from a CPU perspective one iteration of the loop could execute in 3 (variables) * 2 (cycles) = 6 cycles. Then add some overhead (very minor) due to the "if" statement in the loop and you end up at 7-8 cycles per iteration = 7-8 billion cycles for the complete loop. So in theory it is possible to execute this in 2 sec on a 3.7 GHz machine.

If the loads are not independent, e.g. some of those are in the same cache line, do they still occupy a read port each?

cachelines have 64byte nowadays, and in these rather parameterless procedures, everything on the stack would be quite close together.

Short answer - yes. Extended answer - the ports are not directly connected to the cache. Instead they work with read/write buffers to de-/multiplex arbitrary sized register info to/from arbitrary memory addresses. It would not be reasonable to try and provide such a DEMUX network into the L1 cache lines directly. However the buffers are aligned to the basic address scheme of the cache. That is they too have a 64 byte granularity to simplify data exchange with the L1 caches.

440bx

  • Hero Member
  • *****
  • Posts: 1994
Re: Optimization suggestion for nested variable access
« Reply #48 on: May 29, 2020, 08:50:46 am »
Current CPU's are multi-issue - that is a single core can execute 2 (860 iirc), 3 (Core-I 2 Nehalem)) or even 4 (everything from Haswell onward) instructions per single clock.
I understand that but, that normally requires the sequence of instructions to be fairly (or totally) independent of each other.  In a sequence of instructions where one instruction depends on the result of the previous one, the CPU's ability to parallelize those instructions is severely compromised, particularly when the value to be used is not a calculated value but, a value retrieved from a yet to be known pointer.

Compilers are extremely "aware" of that limitation and it is because of it that they will very often rearrange instruction sequences whenever possible.

I accept that _somehow_ an i8700k manages to execute that sequence in less than half the time the logical dependencies among the instructions requires but, I don't see how. I'm sure some - probably many of the things you've mentioned - come into play but, I do not see how they could be sufficient.  It's like they managed to make clairvoyant CPUs that know the result before it's even calculated.  That sequence of instructions is hard (and "hard" is an understatement in this particular case) to parallelize even with highly speculative execution.

My hat off to the Intel guys that somehow managed that feat.
FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

MathMan

  • Full Member
  • ***
  • Posts: 205
Re: Optimization suggestion for nested variable access
« Reply #49 on: May 29, 2020, 10:45:33 am »
I accept that _somehow_ an i8700k manages to execute that sequence in less than half the time the logical dependencies among the instructions requires but, I don't see how. I'm sure some - probably many of the things you've mentioned - come into play but, I do not see how they could be sufficient.  It's like they managed to make clairvoyant CPUs that know the result before it's even calculated.  That sequence of instructions is hard (and "hard" is an understatement in this particular case) to parallelize even with highly speculative execution.

I can fully understand your problems with accepting that there is parallelism in this. When looking at the assembler code it seems totally infeasible. I'll try to explain where the parallelism comes into play.

The first thing you have to understand here is that the assembler code is today only a very shallow representation of what a CPU is actually doing.

When looking at the 5 instruction sequence than in fact this is inherently sequential - they form whats called a "dependency chain". Lets assume that these five instructions represent the setting of the first variable. Then, and this is crucial for understanding, there is a completely independent chain of five instructions to set the second variable. These two independent chains could be executed in parallel - and the sequence for setting the third variable could again be executed in parallel to the first two. And so on, while you are round robbing through the loop. So - magically there is already parallelism!

Immediate reaction to that is "but wait, dont't all the assembler instructions refer to RAX? How can this be parallelized then?". That is, because the CPU internally isn't using RAX at all - it uses representations of RAX and there can be many (up to 168 on a Skylake Core-I 7xxx). So for the first 5 instructions it will use let's say RAX[1], on the second RAX[2], etc. And suddenly it can exploit the parallelism of the code.

Another reaction might be "but it has to execute one instruction after the other, otherwise there would be no chance to evaluate states like register content, condition codes etc.". When a CPU reads an instruction sequence it tags every instruction in a way that states the sequence. Internally it is then free to shuffle things around until the point where instructions have to be put off as "being executed" <= this is handled in the "retirement unit" of the CPU. And the "retirement unit" then sequences instructions back to the way they were read from the instruction stream. Now there can be several 100 instructions "in flight" inside the CPU core - and thus the parallelism of the code is actually exploited.

Of course there are instances when the CPU has to revert to some determined state <= that is everything is executed up to a specific instruction in the instruction stream. This happens implicitely e.g. when an interrupt occures, or explcitely with the CPUID instruction. But beside that, during "normal" execution you can not tell from the outside which instructions the CPU is actually working on! It's like "Schrödingers Katze" - only when you lift the lid you see in what state it is.

Hope that made this a bit clearer.

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 8726
  • FPC developer.
Re: Optimization suggestion for nested variable access
« Reply #50 on: May 29, 2020, 02:20:47 pm »
Current CPU's are multi-issue - that is a single core can execute 2 (860 iirc), 3 (Core-I 2 Nehalem)) or even 4 (everything from Haswell onward) instructions per single clock.

6 uops even when executing from the uop cache.

 

TinyPortal © 2005-2018