Yes, that was a good article to understand how the cache works.
But, you don't know how the compiler optimize or simply "translate" your code. May be in assembler view should be the better way to analyze the fact. And you are sure about your interpretation of memory cache alghoritm ? I don't think that someone know it to determine exactly the function in this way (multi-level N-way set-associative caches questions).
Other important things is that you are running a simple "routine" in a user level, with lower priority than the other "routines" of your system like drivers or others software that run at kernel mode or at other priority level.
So you are "measure" the performance of your system, not cache processor, 'cause your routine wil be interrupt a lot of time, and you don't know if your code is running in more than one thread (also inside the same core).
Other things are about all new "innovation" with new architecture like hybrid core, ITD (Intel thread director) and others ...