@schuler,
here is my understanding on the use of
Fused Multiply Accumulate - FMA - a bit lengthy, please bear with me.
TL/DR - the functions
pcr_fmaf and
pcr_fma have to be replaced. There seems to be no way to avoid a bit-wizardry based fallback option.
9. Pascal CORE-MATH currently uses the following implementation to make
FMA3 instructions accessible under x86-64
function pcr_fmaf(x, y, z: Single): Single;
{$IFDEF CPUX86_64}
// Pure-asm: System V AMD64 ABI passes x→xmm0, y→xmm1, z→xmm2; result in xmm0.
// VFMADD213SS: xmm0 = xmm0 * xmm1 + xmm2 (correctly rounded IEEE 754 FMA).
assembler;
asm
vfmadd213ss xmm0, xmm1, xmm2
end;
{$ELSE}
begin
// 80-bit fallback: correctly rounded for singles (Extended has enough mantissa bits).
Result := Single(Extended(x) * Extended(y) + Extended(z));
end;
{$ENDIF}
function pcr_fma(x, y, z: Double): Double;
{$IFDEF CPUX86_64}
// Pure-asm: System V AMD64 ABI passes x→xmm0, y→xmm1, z→xmm2; result in xmm0.
// VFMADD213SD: xmm0 = xmm0 * xmm1 + xmm2 (correctly rounded IEEE 754 FMA).
assembler;
asm
vfmadd213sd xmm0, xmm1, xmm2
end;
{$ELSE}
begin
// 80-bit fallback (double-rounding — not true FMA; may lose 1 ULP in rare cases).
Result := Double(Extended(x) * Extended(y) + Extended(z));
end;
{$ENDIF}9.1. The
asm variant depends on
{$IFDEF CPUX86_64}, which is false. There are x86-64 CPU that do not support
FMA3 - e.g. AMD Bulldozer - where this will lead to illegal instruction exceptions.
9.1.1. On x86-64 at least all CPU that support
AVX2 also support
FMA3. There is a small set of CPU that support the latter without supporting the former - e.g. AMD Steamroller.
9.1.2. Afaik FPC 3.2.2. has no reliable internal detection mechanism for support of
FMA on different CPU architectures (x86-64, ARM, etc.) or
AVX2 on x86-64 - e.g. via a publizised run-time define from the compiler.
9.1.3. Suggestion: at least make the
asm variant depend on
{$IFDEF AVX2} where the define has to be provided via command line '-dAVX2'.
9.2. The fallback variant uses a double rounding approach via the
Extended type - this is not going to work x-plattform not even for the target architecture x86-64.
9.2.1.
Programmer’s Guide for Free Pascal, Version 3.2.2 states
Extended
For Intel 80x86 processors, the extended type takes up 10 bytes of memory space. For more
information on the extended type, consult the Intel Programmer’s reference.
For all other processors which support floating point operations, the extended type is a nickname
for the type which supports the most precision, this is usually the double type. On processors
which do not support co-processor operations (and which have the {$E+} switch), the extended
type usually maps to the single type.
On Win64 especially the type
Extended is aliased with
Double, leaving zero bit redundancy in the mantissa for
pcr_fma. On some other plattforms this even holds for
pcr_fmaf.
9.2.2. Due to the above the fallback alternative must be implemented via bit-wizardry.
9.3. Potential future option
9.3.1. The FPC 3.2.2 RTL exports intrinsics
FMASingle,
FMADouble together with run-time defines
FPC_HAS_FAST_FMA_SINGLE and
FPC_HAS_FAST_FMA_DOUBLE. The intrinsics are stated as 'Do not use' in the runtime manual. Though not reliable yet this might become available in future FPC versions. If so, the
asm variant could be replaced and by that the
pcr_... can be fully inlined by the compiler.
9.3.2. I could convince FPC 3.2.2 to compile the intrinsics under the following settings '-CfAVX2 -CpCOREAVX2 -OpCOREAVX2'. I replaced the
pcr_... with equivalent intrinsics and checked the generated object code. A quick run of the benchmark utility yielded unchanged 'GlobalSink' value. However, I could not convince FPC 3.3.1 to do the same, even though I tried several target architecture settings.
I would really like a feedback from FPC core devs if my summary above is correct, and especially if my envisioned co-operation of intrinsics and defines is the way to go.