Have you compared the difference between the hand-crafted assembly and what is generated by the Pascal compiler?
It seems FPC is far from the ideal code for this. Other compilers solved it quite differently.
I don't think we have enough info to blame the compiler. We *might* if you showed us the code that was used for the other languages, as well as any compile options.
Without that, and without a comparison of the assembler that was generated in any of the other cases, we have no way of drilling down to where the issue might be.
I did change the WRITELN statements, and the hash matches:
filehash -p C:\Temp\FasterPrime.TXTSHA2_256:
f13156e206e68386cb86b13093520acc5da04c875926411bd4df4e76590e81cf {File} C:\Temp\FasterPrime.TXT
Still ~12.4 seconds.
My fast prime generator is using a segmented sieve routine. Blisteringly fast. My normal prime generator completes the same 1M prime numbers in ~13 seconds, without printing anything, or ~22 when printing. It could be faster, but it check for CTRL-C and CTRL-BREAK handling, among other things, to be able to provide a status is you prematurely abort operation.