### Bookstore

 Computer Math and Games in Pascal (preview) Lazarus Handbook

### Author Topic: Optimization suggestion for nested variable access  (Read 4915 times)

#### MarkMLl

• Hero Member
• Posts: 1727
##### Re: Optimization suggestion for nested variable access
« Reply #45 on: May 28, 2020, 09:53:28 pm »
Your "18 div freq" would then be 18/4.7 = 3.8. Somehow that is exactly 2*1.9. So somehow the CPU must have optimized it further....
(though there are more than 18 asm statements)

But based on my reading around when the Intel uCode flaws were forced into the open, those 18+ x86 or x86_64 opcodes would have been translated internally into triads of simpler instructions, which themselves would be optimised and swapped around at the hardware level. And AIUI there are various trace buffers etc. which mean that the lowest level optimisations can evolve with time.

MarkMLl
Turbo Pascal v1 on CCP/M-86, multitasking with LAN and graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.

#### MathMan

• Full Member
• Posts: 209
##### Re: Optimization suggestion for nested variable access
« Reply #46 on: May 29, 2020, 08:07:37 am »
So from a CPU perspective one iteration of the loop could execute in 3 (variables) * 2 (cycles) = 6 cycles. Then add some overhead (very minor) due to the "if" statement in the loop and you end up at 7-8 cycles per iteration = 7-8 billion cycles for the complete loop. So in theory it is possible to execute this in 2 sec on a 3.7 GHz machine.
I see what you're saying but, this is what I cannot reconcile in my mind:

There are 3 such sequences, therefore 3 * 5 (instructions per sequence) = 15, one "if" at, let's say 1 clock cycle, that makes it 16, one "for" index test, again at 1 clock cycle, that's 17 and, there is a jump to implement the loop, another clock cycle, for a total of 18.  That's done a billion times, for a total of 18 billion cycles.  At 3.7Ghz, that would be 18/3.7 = 4.86 seconds.

Getting execution time to be under 4.86 seconds is conceivable but, getting it to be less than _half_ that time, that's the part I am having a hard time with.

Current CPU's are multi-issue - that is a single core can execute 2 (860 iirc), 3 (Core-I 2 Nehalem)) or even 4 (everything from Haswell onward) instructions per single clock. That might be the point resolving that mental barrier. Again - reaching this limit on actual code is tough, due to various internal limitations of a CPU. But it is feasable - I have coded some arithmetic functions that actually reach 3.5 instructions per clock (IPC) sustained. But I just see that Martin also explained the link to IPC.

#### MathMan

• Full Member
• Posts: 209
##### Re: Optimization suggestion for nested variable access
« Reply #47 on: May 29, 2020, 08:13:47 am »
Nevertheless, from a modern multi-issue CPU's perspective the calculation would look completely different. There are four memory reads and one memory write per variable. The most CPU's mentioned so far have two read-ports and one write port, which can be used in parallel (if enough capacity for address calculation is available). That means that the sequence can be executed in just 2 clock cycles in a continuous model. To really do this several conditions must be met, especially as the code reuses RAX all the way and meeting these conditions really makes the difference between Intel Core-I generations. So from a CPU perspective one iteration of the loop could execute in 3 (variables) * 2 (cycles) = 6 cycles. Then add some overhead (very minor) due to the "if" statement in the loop and you end up at 7-8 cycles per iteration = 7-8 billion cycles for the complete loop. So in theory it is possible to execute this in 2 sec on a 3.7 GHz machine.

If the loads are not independent, e.g. some of those are in the same cache line, do they still occupy a read port each?

cachelines have 64byte nowadays, and in these rather parameterless procedures, everything on the stack would be quite close together.

Short answer - yes. Extended answer - the ports are not directly connected to the cache. Instead they work with read/write buffers to de-/multiplex arbitrary sized register info to/from arbitrary memory addresses. It would not be reasonable to try and provide such a DEMUX network into the L1 cache lines directly. However the buffers are aligned to the basic address scheme of the cache. That is they too have a 64 byte granularity to simplify data exchange with the L1 caches.

#### 440bx

• Hero Member
• Posts: 2088
##### Re: Optimization suggestion for nested variable access
« Reply #48 on: May 29, 2020, 08:50:46 am »
Current CPU's are multi-issue - that is a single core can execute 2 (860 iirc), 3 (Core-I 2 Nehalem)) or even 4 (everything from Haswell onward) instructions per single clock.
I understand that but, that normally requires the sequence of instructions to be fairly (or totally) independent of each other.  In a sequence of instructions where one instruction depends on the result of the previous one, the CPU's ability to parallelize those instructions is severely compromised, particularly when the value to be used is not a calculated value but, a value retrieved from a yet to be known pointer.

Compilers are extremely "aware" of that limitation and it is because of it that they will very often rearrange instruction sequences whenever possible.

I accept that _somehow_ an i8700k manages to execute that sequence in less than half the time the logical dependencies among the instructions requires but, I don't see how. I'm sure some - probably many of the things you've mentioned - come into play but, I do not see how they could be sufficient.  It's like they managed to make clairvoyant CPUs that know the result before it's even calculated.  That sequence of instructions is hard (and "hard" is an understatement in this particular case) to parallelize even with highly speculative execution.

My hat off to the Intel guys that somehow managed that feat.
FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

#### MathMan

• Full Member
• Posts: 209
##### Re: Optimization suggestion for nested variable access
« Reply #49 on: May 29, 2020, 10:45:33 am »
I accept that _somehow_ an i8700k manages to execute that sequence in less than half the time the logical dependencies among the instructions requires but, I don't see how. I'm sure some - probably many of the things you've mentioned - come into play but, I do not see how they could be sufficient.  It's like they managed to make clairvoyant CPUs that know the result before it's even calculated.  That sequence of instructions is hard (and "hard" is an understatement in this particular case) to parallelize even with highly speculative execution.

I can fully understand your problems with accepting that there is parallelism in this. When looking at the assembler code it seems totally infeasible. I'll try to explain where the parallelism comes into play.

The first thing you have to understand here is that the assembler code is today only a very shallow representation of what a CPU is actually doing.

When looking at the 5 instruction sequence than in fact this is inherently sequential - they form whats called a "dependency chain". Lets assume that these five instructions represent the setting of the first variable. Then, and this is crucial for understanding, there is a completely independent chain of five instructions to set the second variable. These two independent chains could be executed in parallel - and the sequence for setting the third variable could again be executed in parallel to the first two. And so on, while you are round robbing through the loop. So - magically there is already parallelism!

Immediate reaction to that is "but wait, dont't all the assembler instructions refer to RAX? How can this be parallelized then?". That is, because the CPU internally isn't using RAX at all - it uses representations of RAX and there can be many (up to 168 on a Skylake Core-I 7xxx). So for the first 5 instructions it will use let's say RAX[1], on the second RAX[2], etc. And suddenly it can exploit the parallelism of the code.

Another reaction might be "but it has to execute one instruction after the other, otherwise there would be no chance to evaluate states like register content, condition codes etc.". When a CPU reads an instruction sequence it tags every instruction in a way that states the sequence. Internally it is then free to shuffle things around until the point where instructions have to be put off as "being executed" <= this is handled in the "retirement unit" of the CPU. And the "retirement unit" then sequences instructions back to the way they were read from the instruction stream. Now there can be several 100 instructions "in flight" inside the CPU core - and thus the parallelism of the code is actually exploited.

Of course there are instances when the CPU has to revert to some determined state <= that is everything is executed up to a specific instruction in the instruction stream. This happens implicitely e.g. when an interrupt occures, or explcitely with the CPUID instruction. But beside that, during "normal" execution you can not tell from the outside which instructions the CPU is actually working on! It's like "Schrödingers Katze" - only when you lift the lid you see in what state it is.

Hope that made this a bit clearer.

#### marcov

• Global Moderator
• Hero Member
• Posts: 9009
• FPC developer.
##### Re: Optimization suggestion for nested variable access
« Reply #50 on: May 29, 2020, 02:20:47 pm »
Current CPU's are multi-issue - that is a single core can execute 2 (860 iirc), 3 (Core-I 2 Nehalem)) or even 4 (everything from Haswell onward) instructions per single clock.

6 uops even when executing from the uop cache.

#### ASBzone

• Sr. Member
• Posts: 492
##### Re: Optimization suggestion for nested variable access
« Reply #51 on: November 12, 2020, 05:38:41 am »
I need to test this on some more processors and see.   I don't have any newer ones to try it on as yet.

I finally got my hands on a Ryzen 4000 series mobile device (Zen2) -- yes, just as Zen3 arrives.

Nonetheless, I revisited this to post the following update:

Compiled on ... FPC 3.2.1-r47152  64-bit  (-O4)
Hardware ...... AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]
OS ............ Windows 10 2009 x64

Runtimes of the executable on various systems:

02.882 seconds -- Above system
02.728 seconds -- AMD Ryzen 7 4800H with Radeon Graphics at 2.9Ghz
06.031 seconds -- Intel® Xeon® CPU E5530 at 2.40GHz [2400 Mhz]
14.797 seconds -- Intel® Core™ i7-4790K CPU at 4.00GHz [3991 Mhz]   **** Spectre/Meltdown mitigations enabled
20.380 seconds -- Intel(R) Core(TM) i5-2410M CPU at 2.30GHz         **** Spectre/Meltdown mitigations enabled

I notice that the performance is very different from when I had compiled it previously...

And then I tested some more and realized that the performance gap has everything to do with the various Spectre/Meltdown mitigations installed on the systems.   When I disabled them, the performance changed dramatically.

Even for the AMD systems.

OLD: 14.207 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]
NEW:  02.882 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]
-ASB: https://www.BrainWaveCC.com

Lazarus v2.0.11 r64269 / FPC v3.2.1-r47846(via FpcUpDeluxe) -- Windows 64-bit install w/32-bit cross-compile
Primary System: Windows 10 Pro x64, Version 2009 (Build 19042)
Other Systems: Windows 10 Pro x64, Version 2004 (Build 19041) or greater

#### 440bx

• Hero Member
• Posts: 2088
##### Re: Optimization suggestion for nested variable access
« Reply #52 on: November 12, 2020, 07:16:01 am »
And then I tested some more and realized that the performance gap has everything to do with the various Spectre/Meltdown mitigations installed on the systems.   When I disabled them, the performance changed dramatically.

Even for the AMD systems.

OLD: 14.207 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]
NEW:  02.882 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]
Thank you for reporting that.

Maybe a sequence of instructions that is subject to multi-stage speculative execution could be used to determine if the mitigations are present or absent.  <evil grin>
FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

#### MathMan

• Full Member
• Posts: 209
##### Re: Optimization suggestion for nested variable access
« Reply #53 on: November 12, 2020, 01:59:32 pm »
And then I tested some more and realized that the performance gap has everything to do with the various Spectre/Meltdown mitigations installed on the systems.   When I disabled them, the performance changed dramatically.

Even for the AMD systems.

OLD: 14.207 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]
NEW:  02.882 seconds -- AMD Ryzen 7 2700X Eight-Core Processor [4Ghz]
Thank you for reporting that.

Maybe a sequence of instructions that is subject to multi-stage speculative execution could be used to determine if the mitigations are present or absent.  <evil grin>

First of all I must say, that I am surprised that AMD cores seem to be equivalently affected by Spectre/Meltdown mitigation under Win10. Looks like MS is activating full mitigation regardless of CPU type.

A reasonable sequence is exactly the one that was presented in the beginning of this thread - a pointer chasing. But there might be easier ways to detect presence than by measuring execution times of pointer chases - e.g. directly querying the BIOS, or the OS. IIRC both can indicate if they have been enhanced to mitigate Spectre/Meltdown.

#### ASBzone

• Sr. Member
• Posts: 492
##### Re: Optimization suggestion for nested variable access
« Reply #54 on: November 12, 2020, 06:49:30 pm »
First of all I must say, that I am surprised that AMD cores seem to be equivalently affected by Spectre/Meltdown mitigation under Win10. Looks like MS is activating full mitigation regardless of CPU type.

Possibly.  I suspect that the software side of the mitigation is onerous from the Windows side.   The only reason I even noticed here, is that I was doing some benchmarking a couple weeks back and noticed on the new laptop that it had a significant impact in just a few areas, and took it off to test.    Then I remembered that I wanted to revisit this, but the scores were all confusing until I remembered that I had disabled the mitigation on a couple of different systems.
-ASB: https://www.BrainWaveCC.com

Lazarus v2.0.11 r64269 / FPC v3.2.1-r47846(via FpcUpDeluxe) -- Windows 64-bit install w/32-bit cross-compile
Primary System: Windows 10 Pro x64, Version 2009 (Build 19042)
Other Systems: Windows 10 Pro x64, Version 2004 (Build 19041) or greater

#### 440bx

• Hero Member
• Posts: 2088
##### Re: Optimization suggestion for nested variable access
« Reply #55 on: November 12, 2020, 09:39:04 pm »
Hope that made this a bit clearer.
In my efforts to see/understand how they managed to make that degree of parallelism occur in that sequence of instructions, I failed to thank you for your efforts to explain it. Therefore, here is the thank you I owed you. Thank you!.
FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.