Thank you very much for this. That code is fast because you're making a single call to a random position. Which is cool. But if you read the entire file serially, you'll find it's quite slow. (My 36GB file takes over 2 minutes.)
You can combine @ASerge's solution with ReadFile.
Basically, using ReadFile you can read as much as 4GB in one shot into the file mapping (at least in theory.) IOW, you don't need to read one page at a time (which will have a tendency to be slow because every time a page is accessed for the first time it causes an exception handled in ring-0 to associate physical memory with the page.)
Doing it that way, the only thing you have to bother yourself with is calculating the correct offset in the mapping for every read. Quite simple actually and extremely quick.
If you want to maximize speed, you could use ReadFileEx and asynchronously read blocks into the mapping and process already read blocks while new ones are being read. Disclaimer: I know it can be done but, I've never indulged into it because I didn't run into a case where the performance gain would be worth the additional code complexity.
ETA:I forgot to mention one thing.
The first time you allocate a large block of memory, Windows normally allocates NO memory at all, it only gives the program address space. if the space is committed then Windows will automatically associate a page of real memory with that address whenever the address is touched (read or written.)
The net effect of that is, the first time a page is accessed a transition to ring-0 takes place for the O/S to map physical memory to that virtual address. This will have one very noticeable side effect: the first time the file is read will take noticeably longer than the subsequent times because the first time incurs all those transitions for the O/S to associate real memory with a virtual address.
You want to make it as fast as ring-3 code can make it ? ... use VirtualAllocEx to map a fairly large buffer (multi-gigabyte) while specifying MEM_LARGE_PAGES. That causes the O/S to associate physical memory with the entire range of address space requested. It also means the allocation will take significantly longer than a normal 4K page allocation because the O/S is actually managing memory to satisfy the request. If that sounds good to you, there are significant downsides to that method: the first one is that enough physical memory in contiguous blocks of 4MB need to be available to satisfy the request. IOW, it is a very real possibility that the request may fail due to insufficient memory because there aren't enough 4MB contiguous blocks to cover the requested range. The other significant problem is that the memory used is NOT paged, it is locked and used exclusively by your process. IOW, you can very easily make one of today's fastest machine be slower than a 64KB original IBM PC by allocating memory that way. Succinctly, it is the easiest way to starve the system of memory. Fortunatley, in addition to all that, you need SeLockMemoryPrivilege to do that.