Just FYI on FFT performance. I refer to this topic
https://forum.lazarus.freepascal.org/index.php/topic,72461.msg567445 which discusses a recursive implementation of the FFT algorithm, translated from C code.
The recursive FFT is very elegant but not the most efficient. Since I am using FFTs a lot, I was interested in the performance, and benchmarked a few variations. Instead of unit ucomplex, I used my own complex unit because it supports procedural calls instead of operators, see below; other than that, it is fairly equivalent.
Results see the attached table. What it means:
"dynamic arrays" = using dynamic arrays as in the original code
"GetMem" = using manual memory management instead (GetMem/FreeMem)
"PooledMem" = using a very simple memory pool in static memory
"Almost Original" is the original code, except that instead of cexp() which expects a complex argument, I calculate sine / cosine directly, avoiding the unnecessary exp() calculation in cexp().
"math.sincos": use sincos() from unit math instead of separate sine and cosine
"Tabulated sincos": use precalculated sine / cosine tables instead
"fast complex copy": Complex numbers are records, and when copying C1 := C2, FPC uses REP MOVSW which is absurdely slow with small structures on x86. Copying the record fields individually is much faster.
"procedural style": Similarly, functions and operators use REP MOVSW to fill in the result. Procedures like mul (C1,C2,C3) are much faster than C3 := C1*C2.
The results are for a n=1024 FFT, IEEE double, FPC3.2.4-rc1 for 32 bit on windows 11, optimisation Level O1 (O2 causes AVs), FPUtype x87. For comparison:
- the original code takes 950µs
- my own optimized non-recursive FFT (no assembler) takes 10µs.