Thank you for your help -- I have just what I want and realize I could have done better with the problem statement.
I forgot to mention I modified my fpc.cfg file with "-Mtp" for Turbo Pascal compatibility and you had to discover that. Also sorry I didn't make it clear I wanted the "lodsd, div ebx, stosd" fast loop at the heart of the procedure. This was shown in TEST1.TXT in my first post of the thread (modified with "db $66" opcodes as required in Turbo Pascal 6.0), and is in marcov's test_div2.txt. That speed gain is the main motivation for going to assembly. In TEST1.TXT I added comments justifying coding it 16 times inline to reduce the impact of the branch instruction, gaining another 20% speed improvement. This procedure is part of an "approximate e (2.718...) to 1,000,000 decimal places" program with most time spent doing those 3 assembly language instructions. I should have said all that, and thanks again.