Recent

Author Topic: FFT performance  (Read 4214 times)

Nitorami

  • Hero Member
  • *****
  • Posts: 598
FFT performance
« on: October 14, 2025, 03:14:55 pm »
Just FYI on FFT performance. I refer to this topic https://forum.lazarus.freepascal.org/index.php/topic,72461.msg567445 which discusses a recursive implementation of the FFT algorithm, translated from C code.

The recursive FFT is very elegant but not the most efficient. Since I am using FFTs a lot, I was interested in the performance, and benchmarked a few variations. Instead of unit ucomplex, I used my own complex unit because it supports procedural calls instead of operators, see below; other than that, it is fairly equivalent.

Results see the attached table. What it means:

"dynamic arrays" = using dynamic arrays as in the original code
"GetMem" = using manual memory management instead (GetMem/FreeMem)
"PooledMem" = using a very simple memory pool in static memory

"Almost Original" is the original code, except that instead of cexp() which expects a complex argument, I calculate sine / cosine directly, avoiding the unnecessary exp() calculation in cexp().
"math.sincos": use sincos() from unit math instead of separate sine and cosine
"Tabulated sincos": use precalculated sine / cosine tables instead
"fast complex copy": Complex numbers are records, and when copying C1 := C2, FPC uses  REP MOVSW which is absurdely slow with small structures on x86. Copying the record fields individually is much faster.
"procedural style": Similarly, functions and operators use REP MOVSW to fill in the result. Procedures like mul (C1,C2,C3) are much faster than C3 := C1*C2.
 
The results are for a n=1024 FFT, IEEE double, FPC3.2.4-rc1 for 32 bit on windows 11, optimisation Level O1 (O2 causes AVs), FPUtype x87. For comparison:
- the original code takes 950µs
- my own optimized non-recursive FFT (no assembler) takes 10µs.
« Last Edit: October 14, 2025, 03:45:11 pm by Nitorami »

ALLIGATOR

  • Sr. Member
  • ****
  • Posts: 281
  • I use FPC [main] 💪🐯💪
Re: FFT performance
« Reply #1 on: October 14, 2025, 06:49:15 pm »
What caught my attention in all this was this part:
O2 causes AVs), FPUtype x87. For comparison:
Can you share this part of the code? It might be a bug in the compiler - then I could create a bug report
I may seem rude - please don't take it personally

Thaddy

  • Hero Member
  • *****
  • Posts: 18303
  • Here stood a man who saw the Elbe and jumped it.
Re: FFT performance
« Reply #2 on: October 14, 2025, 07:00:17 pm »
I leave that to the hexagonal wheels, the re-inventors. Good night.
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

Nitorami

  • Hero Member
  • *****
  • Posts: 598
Re: FFT performance
« Reply #3 on: October 14, 2025, 07:05:41 pm »
This IS a compiler bug, and I reported a bug with O2 earlier- can't remember the issue - and seem to remember I am not the only one. This is difficult to trace, as every small change in the code somewhere may make the error disappear, even the order how local variables are declared. Problem is the line number reported by the compiler points to a completely unsuspicious part of codeso you need to look into assembler.

ALLIGATOR

  • Sr. Member
  • ****
  • Posts: 281
  • I use FPC [main] 💪🐯💪
Re: FFT performance
« Reply #4 on: October 14, 2025, 07:43:38 pm »
I would take a look, but I don't have the code )

UPD: However, if you have already reported these errors, I have already found the error reports you created, ok.
« Last Edit: October 14, 2025, 07:49:25 pm by ALLIGATOR »
I may seem rude - please don't take it personally

Nitorami

  • Hero Member
  • *****
  • Posts: 598
Re: FFT performance
« Reply #5 on: October 14, 2025, 07:48:19 pm »
I'll try to prepare an example crashing with -O2 tomorrow.

jamie

  • Hero Member
  • *****
  • Posts: 7300
Re: FFT performance
« Reply #6 on: October 15, 2025, 12:18:11 am »
I have a project that uses this, it's an old project but it still works.

The only true wisdom is knowing you know nothing

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12523
  • FPC developer.
Re: FFT performance
« Reply #7 on: October 15, 2025, 08:47:39 am »
I use Nils Haeck's version that can also do non power of two, my typical samplesize is 400.


Nitorami

  • Hero Member
  • *****
  • Posts: 598
Re: FFT performance
« Reply #8 on: October 15, 2025, 10:33:06 am »
@jamie, marcov: Thank you. I mainly wanted to show how the FPC is really inefficient at handling small records like complex numbers. We get a factor 3 in performance merely by manually shaping the code to avoid ":=" record assignments and operators. I think if the compiler would avoid REP MOVSW on small structures, this would be much more worthwhile than many micro-optimizations proposed in the forum or on the bug tracker. But I have no idea what effort that would be. I may just issue a bug report and see.

@ALLIGATOR
I am not sure if the bug is the same as that reported earlier. It is related to optimisation Level 2 and inlining. If you want to have a look, here is the code. It will crash with an AV when compiled with  -Mobjfpc -O2 -OpCOREAVX -CpCOREAVX -Si.

Initially I thought it was my fault, maybe a range or memory violation, or stack overflow. But no error is reported with all checks ON, heaptrc reports no error, and the code runs ok when compiled with even slightly different settings than above. So I believe this is the O2 + inlining bug, but I cannot read assembler and tell where exactly the problem is.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12523
  • FPC developer.
Re: FFT performance
« Reply #9 on: October 15, 2025, 10:43:26 am »
@jamie, marcov: Thank you. I mainly wanted to show how the FPC is really inefficient at handling small records like complex numbers. We get a factor 3 in performance merely by manually shaping the code to avoid ":=" record assignments and operators. I think if the compiler would avoid REP MOVSW on small structures, this would be much more worthwhile than many micro-optimizations proposed in the forum or on the bug tracker. But I have no idea what effort that would be. I may just issue a bug report and see.

Efficiency of string primitives is afaik processor dependent. On the intel side, post Skylake based CPUs should be better. (so 10th or 11th generation with a letter in the name, or 12th generation Alderlake or later). On the AMD side I don't know precisely

Nitorami

  • Hero Member
  • *****
  • Posts: 598
Re: FFT performance
« Reply #10 on: October 15, 2025, 11:01:41 am »
I'm on a 12th Gen Intel Core(TM) i5-1245U and it is abysmal. I tested REP MOVSW performance a while ago and don't remember the details but remember that copying larger structures like 500 bytes or so is, in absolute times, faster than copying a complex double. So I need to copy the records elementwise, to force the compiler to use individual MOVs.

Edit: Test results, copying a static array of dword of various length on 12th Gen Intel Core(TM) i5-1245U:

array length          time to copy, ns   
1                      0.3   
2                      0.3   
4                      23 <- this is equivalent to type complex (double)
8                      23   
16                        8   
32                        8   
64                        8   
128                      13   
256                      14   
512                      16   
1024              22   

Absurd.
« Last Edit: October 15, 2025, 11:46:49 am by Nitorami »

creaothceann

  • Full Member
  • ***
  • Posts: 196
Re: FFT performance
« Reply #11 on: October 15, 2025, 12:54:33 pm »
The results are for [...] 32 bit

What about 64-bit?
And don't start an argument, I am right.

Nitorami

  • Hero Member
  • *****
  • Posts: 598
Re: FFT performance
« Reply #12 on: October 15, 2025, 01:18:14 pm »
These are the times for 64bit target (Lazarus 2.2.6)

array of dword,    time to copy, ns (target 64 bit)
length   

1                          1.8
2                          1.8
4                          1.8
8                          22
16                         9
32                         9
64                         9
128                      13
256                      14
512                      16
1024                    24

As you can see, a complex (double) record would just still be in the fast region, and consequently, the "fastcopy" and "procedural style" mods from above have no benefit here. From a record size of 8 dwords and above, we have the same oddities as in 32 bit.
« Last Edit: October 15, 2025, 02:03:14 pm by Nitorami »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12523
  • FPC developer.
Re: FFT performance
« Reply #13 on: October 15, 2025, 01:29:52 pm »
I compiled with trunk, but didn't see rep movsw's.

Nitorami

  • Hero Member
  • *****
  • Posts: 598
Re: FFT performance
« Reply #14 on: October 15, 2025, 01:36:41 pm »
Code (FPC 3.2.4-rc1, 32 bit, Textmode IDE... yes I know... I am an old fart):

Code: Pascal  [Select][+][-]
  1. uses sysutils;
  2.  
  3. const size = 2*2;
  4.       iter = 100000*1000;
  5. type TRec = array [1..size] of qword;
  6.  
  7. var i: integer;
  8.     T0: TDateTime;
  9.     s, s1: TRec;
  10.  
  11. begin
  12.   T0 := Now;
  13.   for i := 1 to iter do begin
  14.     s1 := s;
  15.   end;
  16.   writeln ((now-T0)*SecsPerDay/iter*1e9:2:2);
  17.   readln;
  18. end.
  19.  

Assembler:
Code: Pascal  [Select][+][-]
  1. 14          s1 := s;
  2. $004016E5:      mov    $0x41b040,%edi
  3. $004016EA:      mov    $0x41b020,%esi
  4. $004016EF:      mov    $0x8,%ecx
  5. $004016F4:      rep movsl %ds:(%esi),%es:(%edi)
  6. $004016F6:      cmp    $0x5f5e100,%eax
  7. $004016FB:      jge    0x4016ff <main+63 at c:/users/e09neum0/documents/pas/bugs/repmovsw.pas:16>
  8. $004016FD:      jmp    0x4016e0 <main+32 at c:/users/e09neum0/documents/pas/bugs/repmovsw.pas:13>
  9.  

I just see that I used qword here, while my original test was with dword. Sorry, relict of tinkering around. But does not make a principal difference.
« Last Edit: October 15, 2025, 01:40:21 pm by Nitorami »

 

TinyPortal © 2005-2018