I accept that _somehow_ an i8700k manages to execute that sequence in less than half the time the logical dependencies among the instructions requires but, I don't see how. I'm sure some - probably many of the things you've mentioned - come into play but, I do not see how they could be sufficient. It's like they managed to make clairvoyant CPUs that know the result before it's even calculated. That sequence of instructions is hard (and "hard" is an understatement in this particular case) to parallelize even with highly speculative execution.
I can fully understand your problems with accepting that there is parallelism in this. When looking at the assembler code it seems totally infeasible. I'll try to explain where the parallelism comes into play.
The first thing you have to understand here is that the assembler code is today only a very shallow representation of what a CPU is actually doing.
When looking at the 5 instruction sequence than in fact this is inherently sequential - they form whats called a "dependency chain". Lets assume that these five instructions represent the setting of the first variable. Then, and this is crucial for understanding, there is a completely independent chain of five instructions to set the second variable. These two independent chains could be executed in parallel - and the sequence for setting the third variable could again be executed in parallel to the first two. And so on, while you are round robbing through the loop. So - magically there is already parallelism!
Immediate reaction to that is "but wait, dont't all the assembler instructions refer to RAX? How can this be parallelized then?". That is, because the CPU internally isn't using RAX at all - it uses representations of RAX and there can be many (up to 168 on a Skylake Core-I 7xxx). So for the first 5 instructions it will use let's say RAX[1], on the second RAX[2], etc. And suddenly it can exploit the parallelism of the code.
Another reaction might be "but it has to execute one instruction after the other, otherwise there would be no chance to evaluate states like register content, condition codes etc.". When a CPU reads an instruction sequence it tags every instruction in a way that states the sequence. Internally it is then free to shuffle things around until the point where instructions have to be put off as "being executed" <= this is handled in the "retirement unit" of the CPU. And the "retirement unit" then sequences instructions back to the way they were read from the instruction stream. Now there can be several 100 instructions "in flight" inside the CPU core - and thus the parallelism of the code is actually exploited.
Of course there are instances when the CPU has to revert to some determined state <= that is everything is executed up to a specific instruction in the instruction stream. This happens implicitely e.g. when an interrupt occures, or explcitely with the CPUID instruction. But beside that, during "normal" execution you can not tell from the outside which instructions the CPU is actually working on! It's like "Schrödingers Katze" - only when you lift the lid you see in what state it is.
Hope that made this a bit clearer.