Ok, this might be a case of "turtles all the way", but shouldn't this be an Object Pascal program, instead of mostly assembly? As Thaddy said, that's why you would want to compare the output of the different backends. I'm pretty sure the winning entry would be a lot slower on any other platform than "LinuxX64", with the main thing of course being generating and finding the hash.
Well, I would probably just use a 32 GB buffer for a CRC32 and an index to the array with names, but then again, I do have 64 GB of RAM.
To sort the list of names, I would start with loading the list into memory and creating 27 threads, who all keep a sorted list with the names. Then read an entry and give it to the correct thread.
Ok, probably not all that fast, but it's the first thing I would try.
After thinking it over:
1. create 27 * 27 threads with a sorted list (name, min, max, total, number)
each for a letter and the last for numbers and spaces
2. read file into memory in blocks
3. Have main give them a line (a name starting with Aa (or Aá, aa, or AA) is for the first thread, Ab for the second, etc)
4. create output in blocks and write
I think I'll experiment with that in a few days.