@ASerge:
Thanks. So I have to do the extra step in non-ASM methods? Alright.
@Akira1364:
Sometimes the programmer has more information available than the compiler, which can make all the difference. Especially when the compiler's usual approach isn't really designed for the problem.
I'm writing an emulator (several million opcodes per second) with an interpreter core. The naive approach reads a byte (opcode) from virtual memory in an endless loop and uses a case-of construct to run the appropriate opcode handler,
which defeats the host CPU's branch predictor. So a more refined approach uses an array of pointers to labels and jumps between them via
computed goto at the end of every opcode handler. The "problem" is that in my case this burns 2*256*8 = 4 KB of the host CPU's L1 data cache (x64 host CPU, 2 guest CPU modes), which is usually 32 KB per core. That's 12.5% of high speed cache occupied which could maybe, possibly be used for other data.
So my idea was to take the opcode byte and transform it (
without further memory accesses) into an opcode handler's memory address. Which is already working, using a fixed virtual memory layout (yes, highly platform specific but that's ok) where I copy each handler's program code to its own strategic position. (Which means no global variables thanks to x64 RIP, but no problem.) The problem right now is to safely return from there when it's time to run the rest of the program.
(This is a "for fun" project, and "just write a JITter" / "write the whole program in ASM" wouldn't be fun.)