Recent

Author Topic: FFT performance  (Read 4308 times)

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12536
  • FPC developer.
Re: FFT performance
« Reply #15 on: October 15, 2025, 01:47:15 pm »
32-bit trunk uses system.move()
64-bit trunk uses SSE registers for 16-byte at once moves.

Code: Pascal  [Select][+][-]
  1. # [14] s1 := s;
  2.         movdqu  U_$P$PROGRAM_$$_S(%rip),%xmm1
  3.         movdqu  U_$P$PROGRAM_$$_S+16(%rip),%xmm0
  4.         movdqu  %xmm1,U_$P$PROGRAM_$$_S1(%rip)
  5.         movdqu  %xmm0,U_$P$PROGRAM_$$_S1+16(%rip)
  6.         subl    $1,%eax
  7.         jne     .Lj3
  8.  


Nitorami

  • Hero Member
  • *****
  • Posts: 598
Re: FFT performance
« Reply #16 on: October 15, 2025, 01:54:01 pm »
Ah, interesting. Maybe I'll install trunk and see how it behaves.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12536
  • FPC developer.
Re: FFT performance
« Reply #17 on: October 15, 2025, 01:57:35 pm »
Enabling AVX2 enables even using those:   (fpc -S2 -al -O4 -Cpcoreavx2 -Cfavx2  example.pas)


Code: Pascal  [Select][+][-]
  1. # [14] s1 := s;
  2.         vmovdqu U_$P$PROGRAM_$$_S(%rip),%ymm0
  3.         vmovdqu %ymm0,U_$P$PROGRAM_$$_S1(%rip)
  4.         subl    $1,%eax
  5.         jne     .Lj3
  6.  

jamie

  • Hero Member
  • *****
  • Posts: 7317
Re: FFT performance
« Reply #18 on: October 16, 2025, 02:01:59 am »
If you implement the Class Operator Copy. something like this.
Code: Pascal  [Select][+][-]
  1. Class operator Test.Copy(Constref A:Test;Var B:Test); Inline;
  2. Begin
  3.  B.im := A.im;
  4.  B.Re := A.Re;
  5. end;                      
  6.  

You get all Registers in use to instead of the System.Move.

However, I could not get that to INLINE so there is a stackless simple call to it.

Quote
004297C0 8B4808                   mov ecx,[eax+$08]
004297C3 894A08                   mov [edx+$08],ecx
004297C6 8B480C                   mov ecx,[eax+$0C]
004297C9 894A0C                   mov [edx+$0C],ecx
unit1.pas:36  B.Re := A.Re;
004297CC 8B08                     mov ecx,[eax]
004297CE 890A                     mov [edx],ecx
004297D0 8B4004                   mov eax,[eax+$04]
004297D3 894204                   mov [edx+$04],eax
004297D6 C3                       ret


The only true wisdom is knowing you know nothing

creaothceann

  • Full Member
  • ***
  • Posts: 205
Re: FFT performance
« Reply #19 on: October 16, 2025, 07:44:24 am »
However, I could not get that to inline

And that is where a macro with parameters would be useful...
Quote from: Thaddy
And don't start an argument, I am right.
Quote from: Thaddy
You have a thorough misunderstanding of what I wrote. Can you provide an example this time? I doubt it. (because you never do out of incompentence)

Nitorami

  • Hero Member
  • *****
  • Posts: 598
Re: FFT performance
« Reply #20 on: October 16, 2025, 10:22:04 am »
So I installed FPC trunk for win32. The textmode IDE has a few hiccups, but I got it running. I'm surprised - and pleased - to say it seems to generate significantly faster code than 3.2.4-rc1, for all my test cases from above. The dreaded REP MOVSL effect is completely gone. Test results below. Same settings -Mobjfpc -O1 -OpCOREAVX -CpCOREAVX -Si.

ALLIGATOR

  • Sr. Member
  • ****
  • Posts: 302
  • I use FPC [main] 💪🐯💪
Re: FFT performance
« Reply #21 on: October 16, 2025, 07:12:22 pm »
@Nitorami
Thank you for preparing the source code
I was able to reproduce the error. I tried to minimize the example, but I couldn't minimize it very much
And since this error reproduces on 3.2.4/3.2.2 but not on FPC [main], I will stop trying to minimize it, as it seems to have already been fixed (I hope) in FPC [main]
I may seem rude - please don't take it personally

Thaddy

  • Hero Member
  • *****
  • Posts: 18363
  • Here stood a man who saw the Elbe and jumped it.
Re: FFT performance
« Reply #22 on: October 19, 2025, 06:46:22 pm »
Can't prepare it, but this code, inspired by fftreal, seems to be really fast:
Code: Pascal  [Select][+][-]
  1. program myfft;
  2. {$mode objfpc}{$I-}{$R-}{$R-}{$C-}
  3. uses
  4.   windows, sysutils, math;
  5. type
  6.   TComplex = packed record
  7.     Re, Im: Double;
  8.   end;
  9.   TComplexArray = array of TComplex;
  10.  
  11. function Complex(const Re, Im: Double): TComplex;inline;
  12. begin
  13.   Result.Re := Re;
  14.   Result.Im := Im;
  15. end;
  16.  
  17. operator + (const a,b:TComplex):TComplex;inline;
  18. begin
  19.   Result.Re := A.Re + B.Re;
  20.   Result.Im := A.Im + B.Im;
  21. end;
  22.  
  23.  
  24. operator - (const A,B:TComplex):TComplex;inline;
  25. begin
  26.   Result.Re := A.Re - B.Re;
  27.   Result.Im := A.Im - B.Im;
  28. end;
  29.  
  30.  
  31. operator * (const A,B:TComplex):TComplex;inline;
  32. begin
  33.   Result.Re := A.Re * B.Re - A.Im * B.Im;
  34.   Result.Im := A.Re * B.Im + A.Im * B.Re;
  35. end;
  36.  
  37.  
  38. procedure FFT(var Data: TComplexArray; Inverse: Boolean);inline;
  39. var
  40.   i, j, k, n, m: Integer;
  41.   w, wm, t, u: TComplex;
  42.   theta, sign: Double;
  43. begin
  44.   n := Length(Data);
  45.   if n <= 1 then Exit;
  46.  
  47.   { Bit-reversal permutation }
  48.   j := 0;
  49.   for i := 0 to n - 1 do
  50.   begin
  51.     if i < j then
  52.     begin
  53.       t := Data[i];
  54.       Data[i] := Data[j];
  55.       Data[j] := t;
  56.     end;
  57.     k := n shr 1;
  58.     while (j >= k) and (k > 0) do
  59.     begin
  60.       Dec(j, k);
  61.       k := k shr 1;
  62.     end;
  63.     Inc(j, k);
  64.   end;
  65.  
  66.   { Set sign for forward/inverse FFT }
  67.   if Inverse then
  68.     sign := -1.0
  69.   else
  70.     sign := 1.0;
  71.  
  72.   { Iterative FFT }
  73.   m := 2;
  74.   while m <= n do
  75.   begin
  76.     theta := sign * 2.0 * Pi / m;
  77.     wm := Complex(Cos(theta), Sin(theta));
  78.    
  79.     for k := 0 to (n div m) - 1 do
  80.     begin
  81.       w := Complex(1.0, 0.0);
  82.       for j := 0 to (m div 2) - 1 do
  83.       begin
  84.         u := Data[k*m + j];
  85.         t := w * Data[k*m + j + m div 2];
  86.         Data[k*m+j]:= u + t;
  87.         Data[k*m + j + m div 2] := u - t;
  88.         w := w * wm;
  89.       end;
  90.     end;
  91.     m := m shl 1;
  92.   end;
  93.  
  94.   { Scale for inverse FFT }
  95.   if Inverse then
  96.     for i := 0 to n - 1 do
  97.     begin
  98.       Data[i].Re := Data[i].Re / n;
  99.       Data[i].Im := Data[i].Im / n;
  100.     end;
  101. end;
  102.  
  103. type
  104.   { must be power of two }
  105.   range = 0..16383;
  106.  
  107. var
  108.   Data: TComplexArray = nil;
  109.   i: Integer;
  110.   time0, time1, time2 : int64;
  111.   timereso    : int64;
  112.   t0, t1      : double;
  113.  
  114. begin
  115.   SetLength(Data, high(range)+1);
  116.   (*
  117.   { Test signal: noise }
  118.   for i in range do
  119.     Data[i].re := Random($7fff) - ($7fff shr 1);
  120.  
  121.   { Test signal }
  122.   for i in range do
  123.   begin
  124.     Data[i].re := -1 + sin (3*2*PI*i/ (high(range)+1))
  125.                 + cos (5*2*PI*i/(high(range)+1)) * 2
  126.                 - sin (7*2*PI*i/(high(range)+1)) * 3
  127.                 + cos (8*2*PI*i/(high(range)+1)) * 5;
  128.   end;
  129.   *)
  130.   { Test signal (sine wave + noise) }
  131.   for i in range do
  132.     data[i].re:=sin(2*Pi*i/(high(range)+1))+((Random(100)/100.0)-0.5);
  133. (*
  134.   for i in range do
  135.     WriteLn(Format('%.2f + %.2fi', [Data[i].Re, Data[i].Im]));
  136.   writeln;
  137. *)
  138.   // timing
  139.   QueryPerformanceFrequency(timereso);
  140.   QueryPerformanceCounter(time0);
  141.  
  142.   { forward FFT }
  143.   FFT(Data,false);  
  144.   QueryPerformanceCounter(time1);
  145.  
  146.   { inverse FFT }
  147.   FFT(Data,true);  
  148.   QueryPerformanceCounter(time2);
  149.  
  150.   t0 := ((time1-time0) / timereso) / (high(range) + 1);
  151.   t1 := ((time2-time1) / timereso) / (high(range) + 1);
  152.  
  153.    
  154.   WriteLn(Format('%d-points FFT in nanoseconds: %.0f ns.', [high(range)+1, t0 * 1000000000]));
  155.   WriteLn(Format('%d-points IFFT + scaling    : %.0f ns.', [high(range)+1, t1 * 1000000000]));
  156. (*  
  157.   { accuracy check }
  158.   for i in range do
  159.     WriteLn(Format('%.2f + %.2fi', [Data[i].Re, Data[i].Im]));
  160. *)
  161.  
  162. end.
Resolution is ns, not us (10^9 instead of 10^6);

The code is based on a mix of the above code and what I learned in about 2002 from fftreal.pas by Laurent de Soras and - Pascal version - Frederic vanMol (Fruity loops/fl studio).
This is fast enough to perform the filter over more than 100 mono channels @ 44100hz in real-time. (slow intel hardware)

In this code, the filter is not applied yet, but the wave is the same.
(I get over 450 mono channels on a Ryzen 7000 series laptop)
Commented out check, but passes.

I basically removed the recursion. It also is very simple to understand.
(at least, much better than the planes in fftw  :D )

Also note input/output is double for the signal provided. im is only relevant for the frequency part, not for the time domain.
Have to get back to the early 2000's and test it on my VST's.

But this is a really good noise remover if the noise is over the wave. I would first apply a IIR high-pass at ~50hz and/or a DC over the signal, though. That's just a 1 pole HP filter (for -3 Db).
Oh, times go by: not necessary to escape FIR filters nowadays for real-time, it seems.
« Last Edit: October 20, 2025, 07:41:06 am by Thaddy »
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

440bx

  • Hero Member
  • *****
  • Posts: 5820
Re: FFT performance
« Reply #23 on: October 19, 2025, 10:57:21 pm »
And since this error reproduces on 3.2.4/3.2.2 but not on FPC [main], I will stop trying to minimize it, as it seems to have already been fixed (I hope) in FPC [main]

OFF TOPIC (and to anyone who wishes to answer - and knows the real answer):

I believe this question has already been answered but, I've lost track of things... the question is, why is it that the upcoming v3.2.4 doesn't include everything in the fixes version of FPC ?  (seems logical to include all fixes in whatever the next release is...)

Thank you for your help.
FPC v3.2.2 and Lazarus v4.0rc3 on Windows 7 SP1 64bit.

ALLIGATOR

  • Sr. Member
  • ****
  • Posts: 302
  • I use FPC [main] 💪🐯💪
Re: FFT performance
« Reply #24 on: October 20, 2025, 04:58:44 am »
why

I think there are several reasons for this:

 - the complexity of the transfer,
 - they didn't notice a useful fix that could have been transferred,
 - the team is small, and what they have is their hobby in their free time,
 - few people outside the team are actively involved,
 - as if the fixes themselves could break something if it's something non-trivial,
 - maybe because of the long absence of intermediate releases - people have simply tasted that great feeling of being at the center of events with FPC[main] 😁,
 - change of VCS,
 - move to Gitlab,
 - change of release manager,
 - complexity of FPC code/architecture (that's what people say, I can't judge the quality of the code myself),
 - something else

Which of these is most common in cases of unported fixes - I can't say

It would be interesting to hear other opinions ☺️
I may seem rude - please don't take it personally

440bx

  • Hero Member
  • *****
  • Posts: 5820
Re: FFT performance
« Reply #25 on: October 20, 2025, 07:11:27 am »
It would be interesting to hear other opinions ☺️
Yes, it would certainly be nice to hear the reason(s) from one of more of the developers.

I'm sure I'm not the only one who'd be interested in "hearing" what they have to say.
FPC v3.2.2 and Lazarus v4.0rc3 on Windows 7 SP1 64bit.

Thaddy

  • Hero Member
  • *****
  • Posts: 18363
  • Here stood a man who saw the Elbe and jumped it.
Re: FFT performance
« Reply #26 on: October 20, 2025, 07:29:35 am »
@440bx
On the off-topic:
The reason is simply that imo an error was made: fixes was not branched to 3.2.5 when the RC1 was frozen.
If you freeze a RC, you also  branch its fixes.(not!) and close its development branch(3.2.3)
Hence fixes 3.2.3 is ahead of 3.2.4. Where the fixes that matter should have been back-ported from 3.2.5 to subsequent RC's and the release.
I warned for that at the time and this is the result.
« Last Edit: October 20, 2025, 07:37:20 am by Thaddy »
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

Nitorami

  • Hero Member
  • *****
  • Posts: 598
Re: FFT performance
« Reply #27 on: October 20, 2025, 10:25:46 am »
Back on topic:
Thaddy, your version is similar to my own one, except I added a number of optimisations like avoiding function calls and assignments which use REPMOVSL, tabulating the sin/cos, and stripping the algorithm down to only process a real-valued input vector (which is all I ever need). I also found that packing the butterfly operation into an inlined routine is faster than doing it directly within the code. Jonas once explained to me the reason is that the compiler cannot (yet) optimise away all array index calculations.

Anyway, my result is 10us for 1024 points, and I would be really impressed if you can do it faster, without using external libraries or assembler, particularly where you don't even bother with any optimisations.
Long story short, your time measurement via QueryPerformanceCounter goes wrong somewhere. Measured correcly in a loop, your algorithm takes 300µs for a 1024 points FFT (500µs for the iFFT) with FPC3.2.4. This is credible and as expected, and within the range of the recursive algorithm.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12536
  • FPC developer.
Re: FFT performance
« Reply #28 on: October 20, 2025, 10:18:28 pm »
I dove up my old code. Todays trunk with AVX2 options on windows 11 on a Ryzen 5700X. But I believe AVX1 will be enough too as it is floating point

With ASM 19-23 us, without about 60us for 1024 points in single math. So that would be 4 times slower than 10us for 1024 double math.

1024 is factored as 4*4*8*8

_BUT_ this is generic code and can also do other non power of two sample lengths. Maybe it can be accelerated somewhat by not doing the factor analysis for every samplesize again each time.

Do you have your most recent code somewhere that I can benchmark on the same machine?

« Last Edit: October 20, 2025, 10:27:33 pm by marcov »

Thaddy

  • Hero Member
  • *****
  • Posts: 18363
  • Here stood a man who saw the Elbe and jumped it.
Re: FFT performance
« Reply #29 on: October 21, 2025, 07:54:36 am »
I cross-posted this, so I moved it to "standart function" subject.
It is relevant here too.
« Last Edit: October 21, 2025, 09:24:09 am by Thaddy »
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

 

TinyPortal © 2005-2018