@marcov:
Your test made me unsure, whether I was tricked by false memory about the issue in general. So I re-read the official Intel assembler optimization guide. Indeed the issue exists (as I remembered) - it is mentioned as "dense RMW issue" for Sandy-Bridge architecture (and former). The point is that the Loop-Streming-Detector has issues with the amount of micro-ops generated by these instruction types in dense loops.
It was however lifted (if I understood the docs correct, they are a bit vague in that respect) at least with Haswell ongoing, maybe even with Ivy Bridge.
@Okoba:
Thanks for testing again. What is puzzling me is that Lazarus default release mode is generating faster code, than e.g. -O4 -OoREGVAR. Did you compile with e.g. Range Checks enabled when not in Lazarus default release mode?
@BrunoK:
Could it be that you are testing on a Laptop? If so benchmarks can vary a lot, if not done with extreme pre-caution like fixing the clock, binding to a specific core, executing initial excessive warm-up code etc.
Takeaway - I'll do some experiments on my systems (one Nehalem, one Skylake) and see if I can pinpoint this down to a reproducible case. If so I'll add this to Okobas case in the bug tracker. Otherwise I'll do some heavy backpaddling to save face
Cheers,
MathMan