Forum > General

Turbo Pascal 6.0, 80386 assembly language, Free Pascal, 64-bit CPU

<< < (3/3)

Rick314:
Thank you for your help -- I have just what I want and realize I could have done better with the problem statement.

I forgot to mention I modified my fpc.cfg file with "-Mtp" for Turbo Pascal compatibility and you had to discover that.  Also sorry I didn't make it clear I wanted the "lodsd, div ebx, stosd" fast loop at the heart of the procedure.  This was shown in TEST1.TXT in my first post of the thread (modified with "db $66" opcodes as required in Turbo Pascal 6.0), and is in marcov's test_div2.txt.  That speed gain is the main motivation for going to assembly.  In TEST1.TXT I added comments justifying coding it 16 times inline to reduce the impact of the branch instruction, gaining another 20% speed improvement.  This procedure is part of an "approximate e (2.718...) to 1,000,000 decimal places" program with most time spent doing those 3 assembly language instructions.  I should have said all that, and thanks again.

marcov:

--- Quote from: Rick314 on February 28, 2015, 07:35:05 pm ---I forgot to mention I modified my fpc.cfg file with "-Mtp" for Turbo Pascal compatibility and you had to discover that. 

--- End quote ---

Don't worry. You learn about such things after almost 17 years of FPC :-)


--- Quote ---Also sorry I didn't make it clear I wanted the "lodsd, div ebx, stosd" fast loop at the heart of the procedure. 

--- End quote ---

Be careful. As said lodsd/stosd USED to be the fast way. However starting with Pentium-I, optimization rules changed. Be sure to benchmark my original vs my -2 to make sure that it is actually faster.

Rick314:

--- Quote from: marcov on February 28, 2015, 03:33:59 pm ---I attached a variant that uses stosd/losd for comparison with the old version.
IIRC those instructions generally aren't faster without rep anymore though.

--- End quote ---


--- Quote from: marcov on February 28, 2015, 08:09:01 pm ---Be careful. As said lodsd/stosd USED to be the fast way. However starting with Pentium-I, optimization rules changed. Be sure to benchmark my original vs my -2 to make sure that it is actually faster.

--- End quote ---
Please clarify what you meant by "...generally aren't faster without rep anymore though."  (rep?)

I will benchmark both versions.  Thanks again for your help.

marcov:

--- Quote from: Rick314 on February 28, 2015, 09:44:58 pm --- Please clarify what you meant by "...generally aren't faster without rep anymore though."  (rep?)

--- End quote ---

I mean that afaik lods<x> and stos<x> use without the "rep;" prefix is generally discouraged out of
performance considerations for processors after the 486.

There are (rare) exceptions though, specially in some AMD processors, where some combinations with short opcodes avoids a decoding bottleneck.

Navigation

[0] Message Index

[*] Previous page

Go to full version