@BrunoK
bk_1 : the 32 bit mode uses the in processor x87 ln intrinsic operation while the 64 bit mode uses a more exacting compiler procedure in 3.2.2\rtl\inc\genmath.inc from line 1370. Seems that the intrinsic x87 floating point is twice faster than the windows x86_64 implementation.
Nearly correct - it uses 'FYL2X FYL2XP1' plus a multiplication by constant - taking ~80-90 cycles, as per Agner Fogs measurements. The 64 bit mode variant for Double is however not more exact - the Intel microcode implementation of x87 'FYL2X FYL2XP1' extended are correct to ~1.5 units in the last place which is much better than anything achievable for Double. See the Intel x87 documentation - e.g. here
http://www.infophysics.net/x87.pdf.
Edit: Oops, sorry - maybe I misread 'intrinsic' as 'mnemonic' here?
In general terms there has been a lot of progress recently (last 10-15 years) on floating point libraries in terms of speed and accuracy for Single & Double algebraic and transcendental functions. Search for the 'Core Math' project of Paul Zimmermann if you want to take a peek. Modern libraries are now faster than the x87 microcode for Single & Double. If the library in FPC RTL is slower, as you suggest, then it probably hasn't been updated for quite some time.
The Extended type nowadays is on a major retreat as far as I can see.
@Thaddy
That is a misconception. SSE3+, AVX<x> and X86-64-V<x> are much faster and even with better precision in some cases. But all these do not belong to the default optimization upto level O2. You can get upwards of 128 bit - here upto 512 - precision from those and that is a lot more than 80 bit. It just takes a bit of understanding what you want exactly...and what the fpu settings do...
I think the misconception is on your side here, sorry. All vector extensions of Intel (and also ARM / RISK-V and others afaia) tops out at Double (IEEE FP64 bit floating point type). What they do offer is the ability to do operations on multiple instances of FP32 or FP64 at the same time, but not providing added precision due to that. You can do e.g. 2 / 4 / 8 multiplications of FP64 in parallel on SSE / AVX2 / AVX512 - but this is not the same as doing 1 FP512 multiplication!
They are faster on the basic operations like +,-,* than the x87 but at the same have lost all algebraic / transzendental functions (with exception of SQRT and INVSQRT) which now must be implemented by libraries.
@tetrastes:
Not to say, that there are no floating point types of high precisions, even Extended is removed, may be it can be used to some extend (?) for all that cool things, at least to have 80-bit precision.
Yes they can - there are some libraries implementing what is called double quad or quad quad - a non IEEE floating point format with 106 / 212 bit precision. However this is stitched together from 2 / 4 IEEE FP64, and though fast seems to have been neglected recently (<= I may be on error here, haven't followed thoroughly).