Recent

Author Topic: AI assisted translation of CORE-MATH to Free Pascal  (Read 5046 times)

MathMan

  • Hero Member
  • *****
  • Posts: 504
Re: AI assisted translation of CORE-MATH to Free Pascal
« Reply #15 on: April 14, 2026, 09:09:53 am »
@nanobit

Well spotted, thank you. I only tested on a 64 bit machine and this escaped me.

Question - you are using the 'and $ffffffff' to avoid possible range or overflow checks, correct? My intention was to embrace the function with

Code: Pascal  [Select][+][-]
  1. {$push}{$R-}{$Q-}
  2. {$pop}

So the anding could be skipped.

nanobit

  • Full Member
  • ***
  • Posts: 189
Re: AI assisted translation of CORE-MATH to Free Pascal
« Reply #16 on: April 14, 2026, 10:36:11 am »
So the anding could be skipped.

Yes, ((Temp2 shl 32) or (MulLo and $FFFFFFFF)) can be safely used after check-off.
Remark: I've already avoided (a and $FFFFFFFF) in the math-operations, because typecasted math-operands are faster there. But the optimization-issue occurs only in math-expressions, not in others (bitwise-or).
« Last Edit: April 14, 2026, 10:38:42 am by nanobit »

Thaddy

  • Hero Member
  • *****
  • Posts: 19155
  • Glad to be alive.
Re: AI assisted translation of CORE-MATH to Free Pascal
« Reply #17 on: April 14, 2026, 10:45:16 am »
Besides, turning R and Q off is x-platform safe.
objects are fine constructs. You can even initialize them with constructors.

schuler

  • Sr. Member
  • ****
  • Posts: 337
Re: AI assisted translation of CORE-MATH to Free Pascal
« Reply #18 on: April 14, 2026, 11:53:28 am »
I will commit nanobit's solution.

MathMan

  • Hero Member
  • *****
  • Posts: 504
Re: AI assisted translation of CORE-MATH to Free Pascal
« Reply #19 on: April 14, 2026, 12:13:50 pm »
@schuler,

here is my understanding on the use of Fused Multiply Accumulate - FMA - a bit lengthy, please bear with me.

TL/DR - the functions pcr_fmaf and pcr_fma have to be replaced. There seems to be no way to avoid a bit-wizardry based fallback option.

9. Pascal CORE-MATH currently uses the following implementation to make FMA3 instructions accessible under x86-64

Code: [Select]
function pcr_fmaf(x, y, z: Single): Single;
{$IFDEF CPUX86_64}
// Pure-asm: System V AMD64 ABI passes x→xmm0, y→xmm1, z→xmm2; result in xmm0.
// VFMADD213SS: xmm0 = xmm0 * xmm1 + xmm2  (correctly rounded IEEE 754 FMA).
assembler;
asm
  vfmadd213ss xmm0, xmm1, xmm2
end;
{$ELSE}
begin
  // 80-bit fallback: correctly rounded for singles (Extended has enough mantissa bits).
  Result := Single(Extended(x) * Extended(y) + Extended(z));
end;
{$ENDIF}

function pcr_fma(x, y, z: Double): Double;
{$IFDEF CPUX86_64}
// Pure-asm: System V AMD64 ABI passes x→xmm0, y→xmm1, z→xmm2; result in xmm0.
// VFMADD213SD: xmm0 = xmm0 * xmm1 + xmm2  (correctly rounded IEEE 754 FMA).
assembler;
asm
  vfmadd213sd xmm0, xmm1, xmm2
end;
{$ELSE}
begin
  // 80-bit fallback (double-rounding — not true FMA; may lose 1 ULP in rare cases).
  Result := Double(Extended(x) * Extended(y) + Extended(z));
end;
{$ENDIF}

9.1. The asm variant depends on {$IFDEF CPUX86_64}, which is false. There are x86-64 CPU that do not support FMA3 - e.g. AMD Bulldozer - where this will lead to illegal instruction exceptions.
9.1.1. On x86-64 at least all CPU that support AVX2 also support FMA3. There is a small set of CPU that support the latter without supporting the former - e.g. AMD Steamroller.
9.1.2. Afaik FPC 3.2.2. has no reliable internal detection mechanism for support of FMA on different CPU architectures (x86-64, ARM, etc.) or AVX2 on x86-64 - e.g. via a publizised run-time define from the compiler.
9.1.3. Suggestion: at least make the asm variant depend on {$IFDEF AVX2} where the define has to be provided via command line '-dAVX2'.

9.2. The fallback variant uses a double rounding approach via the Extended type - this is not going to work x-plattform not even for the target architecture x86-64.
9.2.1. Programmer’s Guide for Free Pascal, Version 3.2.2 states

Quote
Extended
For Intel 80x86 processors, the extended type takes up 10 bytes of memory space. For more
information on the extended type, consult the Intel Programmer’s reference.
For all other processors which support floating point operations, the extended type is a nickname
for the type which supports the most precision, this is usually the double type. On processors
which do not support co-processor operations (and which have the {$E+} switch), the extended
type usually maps to the single type.

On Win64 especially the type Extended is aliased with Double, leaving zero bit redundancy in the mantissa for pcr_fma. On some other plattforms this even holds for pcr_fmaf.
9.2.2. Due to the above the fallback alternative must be implemented via bit-wizardry.

9.3. Potential future option
9.3.1. The FPC 3.2.2 RTL exports intrinsics FMASingle, FMADouble together with run-time defines FPC_HAS_FAST_FMA_SINGLE and FPC_HAS_FAST_FMA_DOUBLE. The intrinsics are stated as 'Do not use' in the runtime manual. Though not reliable yet this might become available in future FPC versions. If so, the asm variant could be replaced and by that the pcr_... can be fully inlined by the compiler.
9.3.2. I could convince FPC 3.2.2 to compile the intrinsics under the following settings '-CfAVX2 -CpCOREAVX2 -OpCOREAVX2'. I replaced the pcr_... with equivalent intrinsics and checked the generated object code. A quick run of the benchmark utility yielded unchanged 'GlobalSink' value. However, I could not convince FPC 3.3.1 to do the same, even though I tried several target architecture settings.

I would really like a feedback from FPC core devs if my summary above is correct, and especially if my envisioned co-operation of intrinsics and defines is the way to go.

gidesa

  • Full Member
  • ***
  • Posts: 248
Re: AI assisted translation of CORE-MATH to Free Pascal
« Reply #20 on: April 14, 2026, 12:46:31 pm »
I obtained these results, hope this helps.

Code: Pascal  [Select][+][-]
  1. AMD Ryzen 7
  2. Free Pascal Compiler version 3.3.1-17136-g3b7d9956ca-dirty [2024/12/27] for x86_64
  3. Optimization options: -O4   -CfFMA -CpZEN3 -OpZEN3
  4.  
Windows 11 64 bit:

Code: Pascal  [Select][+][-]
  1. === FPC vs Pascal CORE-MATH (PCM) Benchmark: 50000000 calls per function ===
  2.  
  3. sinf              FPC:   11,1 Mops/s  PCM:   58,3 Mops/s  FASTER! YAY!
  4. cosf              FPC:   10,9 Mops/s  PCM:   58,3 Mops/s  FASTER! YAY!
  5. tanf              FPC:   27,5 Mops/s  PCM:   56,3 Mops/s  FASTER! YAY!
  6. asinf             FPC:  128,2 Mops/s  PCM:  253,8 Mops/s  FASTER! YAY!
  7. acosf             FPC:  104,8 Mops/s  PCM:  219,3 Mops/s  FASTER! YAY!
  8. atanf             FPC:  163,9 Mops/s  PCM:  185,9 Mops/s  FASTER! YAY!
  9. sinhf             FPC:  153,8 Mops/s  PCM:  160,8 Mops/s  TIE
  10. coshf             FPC:  180,5 Mops/s  PCM:  188,0 Mops/s  TIE
  11. tanhf             FPC:   89,4 Mops/s  PCM:  206,6 Mops/s  FASTER! YAY!
  12. asinhf            FPC:   31,0 Mops/s  PCM:  123,8 Mops/s  FASTER! YAY!
  13. acoshf            FPC:   82,4 Mops/s  PCM:  196,1 Mops/s  FASTER! YAY!
  14. atanhf            FPC:   31,9 Mops/s  PCM:  237,0 Mops/s  FASTER! YAY!
  15. expf              FPC:  165,0 Mops/s  PCM:  257,7 Mops/s  FASTER! YAY!
  16. logf              FPC:   55,7 Mops/s  PCM:  200,8 Mops/s  FASTER! YAY!
  17. log2f             FPC:   52,8 Mops/s  PCM:  227,3 Mops/s  FASTER! YAY!
  18. log10f            FPC:   52,1 Mops/s  PCM:  223,2 Mops/s  FASTER! YAY!
  19. atan2f            FPC:  151,1 Mops/s  PCM:  104,6 Mops/s
  20. hypotf            FPC:  253,8 Mops/s  PCM:  139,7 Mops/s
  21. powf              FPC:   31,5 Mops/s  PCM:   52,9 Mops/s  FASTER! YAY!
  22. sincosf           FPC:   22,0 Mops/s  PCM:   54,5 Mops/s  FASTER! YAY!
  23.  
  24. PCM won: 16  |  FPC won: 2  |  Ties (<5%): 2
  25. On average, PCM is 2,8x faster than FPC (arithmetic mean over 19 functions)
  26. GlobalSink = 2297463453 (prevents dead-code elimination)
  27.  

Ubuntu 24.04 64 bit (WSL embedded into Windows):

Code: Pascal  [Select][+][-]
  1. === FPC vs Pascal CORE-MATH (PCM) Benchmark: 50000000 calls per function ===
  2.  
  3. sinf              FPC:   29.8 Mops/s  PCM:   48.2 Mops/s  FASTER! YAY!
  4. cosf              FPC:   47.5 Mops/s  PCM:   46.9 Mops/s  TIE
  5. tanf              FPC:   29.2 Mops/s  PCM:   46.9 Mops/s  FASTER! YAY!
  6. asinf             FPC:   19.3 Mops/s  PCM:  266.0 Mops/s  FASTER! YAY!
  7. acosf             FPC:   21.5 Mops/s  PCM:  196.9 Mops/s  FASTER! YAY!
  8. atanf             FPC:   20.0 Mops/s  PCM:  177.3 Mops/s  FASTER! YAY!
  9. sinhf             FPC:   53.5 Mops/s  PCM:  152.0 Mops/s  FASTER! YAY!
  10. coshf             FPC:   49.0 Mops/s  PCM:  176.1 Mops/s  FASTER! YAY!
  11. tanhf             FPC:   31.3 Mops/s  PCM:  180.5 Mops/s  FASTER! YAY!
  12. asinhf            FPC:   16.2 Mops/s  PCM:  104.4 Mops/s  FASTER! YAY!
  13. acoshf            FPC:   27.4 Mops/s  PCM:  211.9 Mops/s  FASTER! YAY!
  14. atanhf            FPC:   13.1 Mops/s  PCM:  241.5 Mops/s  FASTER! YAY!
  15. expf              FPC:   42.1 Mops/s  PCM:  151.1 Mops/s  FASTER! YAY!
  16. logf              FPC:   50.0 Mops/s  PCM:  154.3 Mops/s  FASTER! YAY!
  17. log2f             FPC:   35.0 Mops/s  PCM:  204.1 Mops/s  FASTER! YAY!
  18. log10f            FPC:   33.2 Mops/s  PCM:  207.5 Mops/s  FASTER! YAY!
  19. atan2f            FPC:   17.0 Mops/s  PCM:   97.5 Mops/s  FASTER! YAY!
  20. hypotf            FPC:   33.1 Mops/s  PCM:  114.2 Mops/s  FASTER! YAY!
  21. powf              FPC:   11.9 Mops/s  PCM:   37.4 Mops/s  FASTER! YAY!
  22. sincosf           FPC:   21.6 Mops/s  PCM:   48.6 Mops/s  FASTER! YAY!
  23.  
  24. PCM won: 19  |  FPC won: 0  |  Ties (<5%): 1
  25. On average, PCM is 5.9x faster than FPC (arithmetic mean over 19 functions)
  26. GlobalSink = 2862567869 (prevents dead-code elimination)
  27.  

Thaddy

  • Hero Member
  • *****
  • Posts: 19155
  • Glad to be alive.
Re: AI assisted translation of CORE-MATH to Free Pascal
« Reply #21 on: April 14, 2026, 01:27:37 pm »
Small question: is logf the same as ln? I have patch code to patch the operators sin, cos, exp and ln in system, but not for math (that does not work yet)
objects are fine constructs. You can even initialize them with constructors.

MathMan

  • Hero Member
  • *****
  • Posts: 504
Re: AI assisted translation of CORE-MATH to Free Pascal
« Reply #22 on: April 14, 2026, 01:54:04 pm »
Small question: is logf the same as ln? I have patch code to patch the operators sin, cos, exp and ln in system, but not for math (that does not work yet)

It is - single precision is indicated by the trailing 'f'.

Thaddy

  • Hero Member
  • *****
  • Posts: 19155
  • Glad to be alive.
Re: AI assisted translation of CORE-MATH to Free Pascal
« Reply #23 on: April 14, 2026, 06:06:51 pm »
I can only patch those against extended. Have to to run tests if it is still faster.
objects are fine constructs. You can even initialize them with constructors.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12851
  • FPC developer.
Re: AI assisted translation of CORE-MATH to Free Pascal
« Reply #24 on: April 14, 2026, 06:15:17 pm »
Retry win64 with

{$excessprecision off}. Though that might not help for RTL routines, just common arithmetic.

schuler

  • Sr. Member
  • ****
  • Posts: 337
Re: AI assisted translation of CORE-MATH to Free Pascal
« Reply #25 on: April 14, 2026, 08:31:32 pm »
@mathman,
Quote
9.1.3. Suggestion: at least make the asm variant depend on {$IFDEF AVX2} where the define has to be provided via command line '-dAVX2'.

I've just pushed this change to git.


MathMan

  • Hero Member
  • *****
  • Posts: 504
Re: AI assisted translation of CORE-MATH to Free Pascal
« Reply #26 on: April 14, 2026, 09:19:50 pm »
@schuler,

I'm looking into bug B from 'tasklist.md' - a bit at odds with the analysis.

First it states

Quote
Root cause (suspected): pcr_fma is a double-rounding approximation (Double(Extended(x)*Extended(y)+Extended(z))), not a true IEEE FMA. When FPC inlines it, the Double(...) cast may not force a spill to memory, leaving intermediate values at 80-bit extended precision. This excess precision in z_pf (the reduced argument) or h_pf propagates through the exponential polynomial and shifts rr_pf to the wrong side of the float32 midpoint.

But in your environment the asm path is taken in pcr_fma. That one uses correctly rounded FMA3 while the error persist (at least as mentioned (at least as per info from 'tasklist.md').

Then there is

Quote
Option 2: Force intermediate variables to be stored to memory (defeating x87 register caching) by adding volatile-style stores — but FPC has no standard mechanism for this.

This is incorrect - a spill to memory can be provoked in FPC like

Code: Pascal  [Select][+][-]
  1. pTypeOfResult( @Result )^ := function( whatever );

Imho the error is introduced somewhere else - do you agree?

Before I continue my analysis - did you do further investigation on your side, that has not been publicised yet?

schuler

  • Sr. Member
  • ****
  • Posts: 337
Re: AI assisted translation of CORE-MATH to Free Pascal
« Reply #27 on: April 15, 2026, 02:17:10 am »
@mathman,
I agree with you. At least a portion of the compoundf problem is unrelated to fma.

I was able to fix the "powf" and made progress in the "compoundf" in the branch "a3" that stands for attempt 3:
https://github.com/joaopauloschuler/pas-core-math/tree/a3

nanobit

  • Full Member
  • ***
  • Posts: 189
Re: AI assisted translation of CORE-MATH to Free Pascal
« Reply #28 on: April 15, 2026, 06:14:30 am »
I will commit nanobit's solution.

Minor optimization: It's safe to remove one mask. I've updated my first post.
(Reason: the anticipated maximal native size (64) will not exceed the size (64) of temp2 and target).

Thaddy

  • Hero Member
  • *****
  • Posts: 19155
  • Glad to be alive.
Re: AI assisted translation of CORE-MATH to Free Pascal
« Reply #29 on: April 15, 2026, 06:54:14 am »
I did my own interface to the C lib. Maybe the way I solved the generics  part by using overloads is of interest, so I attach it. It is recommended to compile the coremath library yourself, because there were some linker issues with the standard binary (coff related).
Of course I understand that a full native Pascal version is preferred. It is just about the generic part.
objects are fine constructs. You can even initialize them with constructors.

 

TinyPortal © 2005-2018