Just a few words on testing/benchmarking.
Comparing performance can (as in "may possible") be influenced by unexpected side effects.
I have once been personally hit by that (some test I did on utf8 processing / don't have it any more):
- So I had 3 or 4 implementations => One of them was like 30 to 40% faster than the others => great.
- Then I changed something, but something that did NOT change the generated asm for any of the methods, nor for the benchmark.
I think I may have changed the order in which the functions were listed in the source, or I removed one of the slow ones?
- The super fast one was suddenly slower (it was still the fastest, but by maybe only 5 to 10%)
After some research I found, that the change I made, had changed alignments. That is the exact same asm sequence simple had moved to start at a different address in memory => and that alone changed speed.
The exact effect may depend a lot on the exact CPU used... But for many Intel CPU loops (especially small loops) can be optimized, by some cache for the internal translation inside the CPU (each asm command runs some microcode in the CPU). And that is cached, and depends on alignment.
There may be other such "CPU internals" that can affect a benchmark.
But that means, that any benchmark can be off by a mid 2 digit percentage, by a change that isn't the actually tested difference.
To counter this it may be needed to have variations of the benchmark. Change compiler set alignments, change ordering of various elements / code blocks) => changes that should not make a difference.
Also benchmarks can have artificial data, with memory layout that does not reflect real scenarios. This affects caching, and can boost any one of the tested scenarios in a way they would not experience in real life.
Next, I don't know, but I would guess being worth to check, is branch prediction.
- Afaik it may use some cache of previous executions? Then running a tight loop would help the branch predictor => meaning: Newly introduced extra conditional branches would be measured to take less time, than they may in reality
- Benchmarks may call code in ways that prefers specific branch decisions much more, than real live execution (or bundle them / make them predictable)
...