Recent

Author Topic: Single / Double / Float speed on x86_64  (Read 1850 times)

Schmitty2005

  • Jr. Member
  • **
  • Posts: 52
Single / Double / Float speed on x86_64
« on: December 11, 2025, 07:54:10 pm »
Based of off this conversation : https://forum.lazarus.freepascal.org/index.php/topic,72887.0.html

Single precision  takes the same amount of time as double precision when doing Sine, Cos, Tanh, and likley other trig functions.  Using the math lib from C, user tetrastes  sped it  up tremendously ( about 5 times ) .  I was looking at the Math unit code in the repository, and found that the x87 calls are used for the x86_64 CPU.    .  https://github.com/fpc/FPCSource/blob/49c9a1f3230f095cdbe3d88687367bf927389711/rtl/x86_64/math.inc#L233  Is there a better way to handle the x86_64 functions ? Is there a way to make a separate call function for singles to gain that speed for those who want it?

I would think that if someone is using 'singles' they don't care about precision and want speed.  'Singles' are taking just as long to calculate as 'double' precision. At that point, why bother with singles ?

I am just a hobbyist.  Are there technical reasons that this is so slow and has to remain that way?  I understand a change could break old code.  What about a compiler switch ?    Can this be changed to make the FPC compiler better and faster when using single precision  ?

Maybe this is too complex to fix  and would require too much of a rewrite ?

What is the best fast math unit that is maintained well enough and works on linux and windows  ?

BTW : Delphi is slow  with single precision too.



« Last Edit: December 11, 2025, 08:12:01 pm by Schmitty2005 »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12645
  • FPC developer.
Re: Single / Double / Float speed on x86_64
« Reply #1 on: December 11, 2025, 09:24:14 pm »
You do have {$excessprecision off} near the top of your code?

https://www.freepascal.org/daily/doc/prog/progsu19.html

Schmitty2005

  • Jr. Member
  • **
  • Posts: 52
Re: Single / Double / Float speed on x86_64
« Reply #2 on: December 11, 2025, 10:39:06 pm »
With this code

Code: Pascal  [Select][+][-]
  1. {$MODE DELPHI}
  2. {$EXCESSPRECISION OFF}
  3. Program fpccpuload;
  4.  
  5.     Uses Math;
  6.  
  7.     Const
  8.       MAX = 99999999;
  9.  
  10.     Var
  11.       x : Integer;
  12.       c : single;
  13.     Begin
  14.       x := 0;
  15.       While x < MAX Do
  16.         Begin
  17.           c := sin(cos(tanh(x)));
  18.           inc(x);
  19.         End;
  20.     End.
  21.  

Regardless of single or  double with  {$excessprecision on} or {$excessprecision off}

On my system:

Code: [Select]
real 0m8.245s
user 0m8.241s
sys 0m0.006s

I have also tried various optimizations, and others have as well in the link to the other discussion.  The speed increase from optimizations are negligible.




srvaldez

  • Full Member
  • ***
  • Posts: 190
Re: Single / Double / Float speed on x86_64
« Reply #3 on: December 12, 2025, 12:09:40 am »
hello Schmitty2005
do you have specific task where you need to compute trig and exp functions very fast?
perhaps with limited range?
have a look at this pdf https://shibata.naist.jp/~n-sibata/pdfs/isc10simd.pdf
it has some good info, I may post some code later

srvaldez

  • Full Member
  • ***
  • Posts: 190
Re: Single / Double / Float speed on x86_64
« Reply #4 on: December 12, 2025, 01:50:56 am »
here's the C code for the SinCos function with a test
Code: C  [Select][+][-]
  1. #include <stdlib.h>
  2. #include <stdio.h>
  3. #include <math.h>
  4. #include <time.h>
  5. #include <Windows.h>
  6.  
  7. double my_timer() {
  8.         LARGE_INTEGER la={ {0, 0} };
  9.         double t;
  10.         QueryPerformanceCounter(&la);
  11.         t = (double)la.QuadPart/10000000.0;
  12.         return t;
  13. }
  14.  
  15. typedef struct sincos_t { double s, c; } sincos_t ;
  16.  
  17. #define N 3
  18.  
  19. sincos_t xsincos0(double s) { // 0 ≤ s < π/4
  20.         int i ;
  21.         // Argument reduction
  22.         s = s * pow(2, -N);
  23.         s = s*s; // Evaluating Taylor series
  24.         s = ((((s/1814400 - 1.0/20160)*s + 1.0/360)*s - 1.0/12)*s + 1)*s;
  25.         for ( i=0;i<N; i++) { s = (4-s) * s; } // Applying double angle formula
  26.         s = s / 2;
  27.         sincos_t sc;
  28.         sc.s = sqrt((2-s)*s ); sc.c = 1 - s;
  29.         return sc ;
  30. }
  31.  
  32. #define PI4_A .7853981554508209228515625 // π/4 split into three parts
  33. #define PI4_B .794662735614792836713604629039764404296875e-8
  34. #define PI4_C .306161699786838294306516483068750264552437361480769e-16
  35. // 4/π
  36. #define M_4_PI 1.273239544735162542821171882678754627704620361328125
  37.  
  38. sincos_t xsincos(double d) {
  39.         double s = fabs(d);
  40.         int q = (int)(s * M_4_PI), r = q + (q & 1);
  41.         s -= r * PI4_A; s -= r * PI4_B; s -= r * PI4_C;
  42.         sincos_t sc = xsincos0(s);
  43.         if (((q + 1) & 2) != 0) { s = sc.c; sc.c = sc.s; sc.s = s; }
  44.         if (((q & 4) != 0) != (d < 0)) sc.s = -sc.s;
  45.         if (((q + 2) & 4) != 0) sc.c = -sc.c;
  46.         return sc ;
  47. }
  48.  
  49. double xsin(double d) { sincos_t sc = xsincos(d); return sc.s; }
  50. double xcos(double d) { sincos_t sc = xsincos(d); return sc.c; }
  51. double xtan(double d) { sincos_t sc = xsincos(d); return sc.s / sc.c; }
  52.  
  53. int main(void)
  54. {
  55.         double y, s, t;
  56.         long i, j;
  57.         t=my_timer();
  58.         s=0;
  59.         for (j=1; j<=1000; j++)
  60.         {
  61.                 for (i=-10000; i<=10000; i++)
  62.                 {
  63.                         y=sin(i);
  64.                         s+=y;
  65.                 }
  66.         }
  67.         t=my_timer()-t;
  68.         printf("sum of sin from -10000 to 10000 = %.15g\n", s);
  69.         printf("elapsed time %.15g seconds\n\n", t);
  70.        
  71.         t=my_timer();
  72.         s=0;
  73.         for (j=1; j<=1000; j++)
  74.         {
  75.                 for (i=-10000; i<=10000; i++)
  76.                 {
  77.                         y=xsin(i);
  78.                         s+=y;
  79.                 }
  80.         }
  81.         t=my_timer()-t;
  82.         printf("sum of xsin from -10000 to 10000 = %.15g\n", s);
  83.         printf("elapsed time %.15g seconds\n", t);
  84.         return 0;
  85. }
  86.  

output
sum of sin from -10000 to 10000 = -1.33221211839896e-12
elapsed time 0.654519799980335 seconds

sum of xsin from -10000 to 10000 = -1.77641235055148e-12
elapsed time 0.149988500052132 seconds

unfortunately FP does not fare so well

SinCosCalc.pas
Code: Pascal  [Select][+][-]
  1. unit SinCosCalc;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. interface
  6.  
  7. uses
  8.   Math;
  9.  
  10. const
  11.   N = 3;
  12.   // π/4 split into three parts
  13.   PI4_A = 0.7853981554508209228515625;
  14.   PI4_B = 0.794662735614792836713604629039764404296875e-8;
  15.   PI4_C = 0.306161699786838294306516483068750264552437361480769e-16;
  16.   // 4/π
  17.   M_4_PI = 1.273239544735162542821171882678754627704620361328125;
  18.  
  19. type
  20.   sincos_t = record
  21.     s: Double;
  22.     c: Double;
  23.   end;
  24.  
  25. function xsincos0(s: Double): sincos_t;
  26. function xsincos(d: Double): sincos_t;
  27. function xsin(d: Double): Double;
  28. function xcos(d: Double): Double;
  29. function xtan(d: Double): Double;
  30.  
  31. implementation
  32.  
  33. function xsincos0(s: Double): sincos_t; // 0 ≤ s < π/4
  34. var
  35.   i: Integer;
  36.   sc: sincos_t;
  37. begin
  38.   // Argument reduction
  39.   s := s * Power(2, -N);
  40.   s := s * s; // Evaluating Taylor series
  41.   s := ((((s / 1814400 - 1.0 / 20160) * s + 1.0 / 360) * s - 1.0 / 12) * s + 1) * s;
  42.   for i := 0 to N - 1 do // Applying double angle formula
  43.   begin
  44.     s := (4 - s) * s;
  45.   end;
  46.   s := s / 2;
  47.   sc.s := Sqrt((2 - s) * s);
  48.   sc.c := 1 - s;
  49.   Result := sc;
  50. end;
  51.  
  52.  
  53. function xsincos(d: Double): sincos_t;
  54. var
  55.   s, temp: Double;
  56.   q, r: Integer;
  57.   sc: sincos_t;
  58. begin
  59.   s := Abs(d);
  60.   q := Trunc(s * M_4_PI);
  61.   r := q + (q and 1);
  62.   s := s - r * PI4_A;
  63.   s := s - r * PI4_B;
  64.   s := s - r * PI4_C;
  65.   sc := xsincos0(s);
  66.   if (((q + 1) and 2) <> 0) then
  67.   begin
  68.     temp := sc.c;
  69.     sc.c := sc.s;
  70.     sc.s := temp;
  71.   end;
  72.   if (((q and 4) <> 0) <> (d < 0)) then
  73.     sc.s := -sc.s;
  74.   if (((q + 2) and 4) <> 0) then
  75.     sc.c := -sc.c;
  76.   Result := sc;
  77. end;
  78.  
  79. function xsin(d: Double): Double;
  80. var
  81.   sc: sincos_t;
  82. begin
  83.   sc := xsincos(d);
  84.   Result := sc.s;
  85. end;
  86.  
  87. function xcos(d: Double): Double;
  88. var
  89.   sc: sincos_t;
  90. begin
  91.   sc := xsincos(d);
  92.   Result := sc.c;
  93. end;
  94.  
  95. function xtan(d: Double): Double;
  96. var
  97.   sc: sincos_t;
  98. begin
  99.   sc := xsincos(d);
  100.   Result := sc.s / sc.c;
  101. end;
  102.  
  103. end.
  104.  

TestSinCos.pas
Code: Pascal  [Select][+][-]
  1. program TestSinCos;
  2. {$mode objfpc}{$H+}
  3.  
  4. uses sysutils, windows, math, SinCosCalc;
  5. var
  6.   y, s: Double;
  7.   time1,time2:Double;
  8.   i, j : int32;
  9. begin
  10.     time1:=now;
  11.     s:=0;
  12.     for j:=1 to 1000 do
  13.     begin
  14.         for i:=-10000 to 10000 do
  15.         begin
  16.             y:=sin(i);
  17.             s:=s+y;
  18.         end;
  19.     end;
  20.     time2:=now;
  21.     WriteLn('Sum of sin in the range of -10000 and 10000 = ', s);
  22.     Writeln('time taken ',((time2-time1)*86400):12:10,' second.');
  23.     Writeln;
  24.     time1:=now;
  25.     s:=0;
  26.     for j:=1 to 1000 do
  27.     begin
  28.         for i:=-10000 to 10000 do
  29.         begin
  30.             y:=xsin(i);
  31.             s:=s+y;
  32.         end;
  33.     end;
  34.     time2:=now;
  35.     WriteLn('Sum of xsin in the range of -10000 and 10000 = ', s);
  36.     Writeln('time taken ',((time2-time1)*86400):12:10,' second.');
  37. end.
  38.  

output
Sum of sin in the range of -10000 and 10000 = -7.1054828687522331E-012
time taken 0.5750000244 second.

Sum of xsin in the range of -10000 and 10000 =  8.8806739739766272E-013
time taken 0.7619998651 second.

The C xsin is over 4 times faster than the Crt lib
« Last Edit: December 12, 2025, 02:08:06 am by srvaldez »

Khrys

  • Sr. Member
  • ****
  • Posts: 391
Re: Single / Double / Float speed on x86_64
« Reply #5 on: December 12, 2025, 07:12:43 am »
At that point, why bother with singles ?

Memory usage!
When doing 3D graphics, you want to minimize the amount of data being sent to the GPU, and it's not like the extra precision would make a visual difference anyways.

Machine learning researchers take this to yet another level by using 16-bit floats, though at that point the format really does become unusable for general-purpose calculations. For example, here are some consecutive (!)  bfloat16  values:

Value        - Exponent Mantissa
1.0          0 01111111 0000000
1.0078125    0 01111111 0000001
1.015625     0 01111111 0000010
1.0234375    0 01111111 0000011
1.03125      0 01111111 0000100

tetrastes

  • Hero Member
  • *****
  • Posts: 739
Re: Single / Double / Float speed on x86_64
« Reply #6 on: December 12, 2025, 11:01:15 am »
The C xsin is over 4 times faster than the Crt lib

What C compiler are you using?

gcc version 14.2.0 (x86_64-win32-seh-rev2, Built by MinGW-Builds project) using ucrt:

without optimization
sum of sin from -10000 to 10000 = -3.55265816764927e-12
elapsed time 0.302194199999576 seconds

sum of xsin from -10000 to 10000 = -1.33232314070142e-12
elapsed time 1.13876749999963 seconds

with -O3
sum of sin from -10000 to 10000 = -3.55265816764927e-12
elapsed time 0.249794700000166 seconds

sum of xsin from -10000 to 10000 = -1.33232314070142e-12
elapsed time 0.278383999999278 seconds

Microsoft (R) C/C++ Optimizing Compiler Version 19.41.34120 for x64 (VS 2022 v.17.11.3):

without optimization
sum of sin from -10000 to 10000 = -3.55265816764927e-12
elapsed time 0.49480290000065 seconds

sum of xsin from -10000 to 10000 = -1.33232314070142e-12
elapsed time 2.84119210000063 seconds

with /O2
sum of sin from -10000 to 10000 = -3.55265816764927e-12
elapsed time 0.253984400000263 seconds

sum of xsin from -10000 to 10000 = -1.33232314070142e-12
elapsed time 1.47731719999865 seconds

srvaldez

  • Full Member
  • ***
  • Posts: 190
Re: Single / Double / Float speed on x86_64
« Reply #7 on: December 12, 2025, 12:50:42 pm »
The C xsin is over 4 times faster than the Crt lib

What C compiler are you using?
gcc version 14.2.0 (MinGW-W64 x86_64-msvcrt-posix-seh, built by Brecht Sanders, r2)
gcc -static -Wall -O3 -march=native TestSinCos.c -o TestSinCos

Processor   Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz, 3600 Mhz, 8 Core(s), 16 Logical Processor(s)

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12645
  • FPC developer.
Re: Single / Double / Float speed on x86_64
« Reply #8 on: December 12, 2025, 12:54:49 pm »
There are multiple different aspects.

Probably $excessprecision is only for generated code, not for library code, and means that singles are not always promoted to doubles.

However the sin/cos stuff touches libraries not having overloads for single. That might be fixed in trunk, I don't know, it's been a while that I played with such options.

Besides that there is also the normal vs fastmath way of doing things. Fastmath is an optimization common in C compilers to have results less precise but faster.  Afaik precise still means copro use, as per standard, and that is what FPC and Delphi do.

FPC currently has no solution for that. Fastmath  is sometimes enabled with higher -O. But how and with which optimization level it is enabled exactly in C is afaik compiler and architecture dependent. Fastmath is common, but not standarized afaik.

So this whole topic is multi facetted.

  • General promotion of single to double {$excessprecision off}
  • Generally only having double/extended functions in math for non elemental operations (maybe (partially) fixed in trunk)
  • Having fastmath or not (using SSE approx for operations, FPC has an option -Ofastmath, but that is only for generated code, not for library code in assembler)


tetrastes

  • Hero Member
  • *****
  • Posts: 739
Re: Single / Double / Float speed on x86_64
« Reply #9 on: December 12, 2025, 01:56:12 pm »
The C xsin is over 4 times faster than the Crt lib

What C compiler are you using?
gcc version 14.2.0 (MinGW-W64 x86_64-msvcrt-posix-seh, built by Brecht Sanders, r2)
gcc -static -Wall -O3 -march=native TestSinCos.c -o TestSinCos

Processor   Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz, 3600 Mhz, 8 Core(s), 16 Logical Processor(s)

gcc for win64 using MSVCRT as runtime library generates very inefficient x87 code for floating point, instead of using functions from CRT, which are much faster for x64.

LV

  • Sr. Member
  • ****
  • Posts: 411
Re: Single / Double / Float speed on x86_64
« Reply #10 on: December 12, 2025, 03:04:18 pm »
I was passing by 🤔

12th Gen Intel(R) Core(TM) i7-12700H (2.30 GHz)

Code: Text  [Select][+][-]
  1. ------------------------
  2. gcc 14.2.0 -o3 -fopenmp
  3. ------------------------
  4. omp_set_num_threads = 1
  5.                              Time = 0.919000 sec
  6. Sum = -0.000000000017208
  7.  
  8. omp_set_num_threads = 20
  9.                              Time = 0.699000 sec
  10. Sum = -0.000000000038277
  11.  
  12. ------------------------
  13. fpc 3.2.2 -o3
  14. ------------------------
  15. program ParallelSin;
  16. ........................
  17. THREAD_COUNT = 1
  18.                              Time = 4.832000 sec
  19. Sum = -0.000000000071054
  20.  
  21. THREAD_COUNT = 20
  22.                              Time = 0.440000 sec
  23. Sum = 0.000000033305696
  24.  
  25. program ParallelSinFast;
  26. ........................
  27. THREAD_COUNT = 1
  28.                              Time = 1.813000 sec
  29. Sum = -0.000000000106582
  30.  
  31. THREAD_COUNT = 20
  32.                              Time = 0.127000 sec
  33. Sum = -0.000000140812062
  34.  

P.S.
Code: Text  [Select][+][-]
  1.   N1 = -10000;
  2.   N2 = 10000;
  3.   LOOPS = 10000;
  4.  
« Last Edit: December 12, 2025, 03:16:50 pm by LV »

LV

  • Sr. Member
  • ****
  • Posts: 411
Re: Single / Double / Float speed on x86_64
« Reply #11 on: December 12, 2025, 04:38:38 pm »
In multithreaded Pascal code, you can observe a decrease in precision. To address this, I implemented Kahan's algorithm, which adds minor overhead.

Code: Text  [Select][+][-]
  1. omp_set_num_threads = 20
  2. Time = 0.662000 sec
  3. Sum = -3.830e-11
  4.  
  5. THREAD_COUNT = 20
  6. Time = 0.149000 sec
  7. Sum = 4.547E-12
  8.  
  9.          Time_C_omp
  10. ————————————————————————————— = 4.442
  11. Time_Pascal_Parallel_KahanSum
  12.  

srvaldez

  • Full Member
  • ***
  • Posts: 190
Re: Single / Double / Float speed on x86_64
« Reply #12 on: December 12, 2025, 05:13:41 pm »
thanks LV  :)
using threads combined with open-mp is tricky  %) not for the faint of heart
I also compiled it using the Intel C compiler, without threads and without omp the time was about .7 seconds, with /Qiopenmp .1 seconds
but I am not comfortable using threads or omp
« Last Edit: December 12, 2025, 05:34:35 pm by srvaldez »

BrunoK

  • Hero Member
  • *****
  • Posts: 766
  • Retired programmer
Re: Single / Double / Float speed on x86_64
« Reply #13 on: December 12, 2025, 05:48:01 pm »
How can one compare an algorithm written in C giving checking values of
Quote
sum of sin from -10000 to 10000 = -1.33221211839896e-12
sum of xsin from -10000 to 10000 = -1.77641235055148e-12
and an algorithm in FPC giving a checking values of
Quote
Sum of sin in the range of -10000 and 10000 = -7.1054828687522331E-012
Sum of xsin in the range of -10000 and 10000 =  8.8806739739766272E-013

Obviously, the FPC algorithm does not do the same thing hence timing comparison is irrelevant.

srvaldez

  • Full Member
  • ***
  • Posts: 190
Re: Single / Double / Float speed on x86_64
« Reply #14 on: December 12, 2025, 05:59:47 pm »
in the case of the sin function, the sum from -N to N should be zero, so the closer to zero the more accurate

 

TinyPortal © 2005-2018