Recent

Author Topic: FFT performance  (Read 4538 times)

Josh

  • Hero Member
  • *****
  • Posts: 1428
Re: FFT performance
« Reply #30 on: October 21, 2025, 11:20:40 am »
Hi Thaddy,
Been twiddling with your code, not finished yet.
Result
Code: Pascal  [Select][+][-]
  1. 16384-point FFT Forward (avg 100 runs, unroll=128): 181.3 ns/point
  2. 16384-point FFT Inverse (avg 100 runs, unroll=128): 180.4 ns/point

I so far allowing unrolling in blocks ie 16,8,4 , have made a few changes so far pi2=pi*2, as its used in side loops, created a BitReverse32 using bitshifting etc, experimenting with reverse lookup table
« Last Edit: October 21, 2025, 11:23:26 am by Josh »
The best way to get accurate information on the forum is to post something wrong and wait for corrections.

Nitorami

  • Hero Member
  • *****
  • Posts: 604
Re: FFT performance
« Reply #31 on: October 21, 2025, 12:30:26 pm »
As Thaddy requested, attached my old FFT routine including a simple textmode test program. I fiddled with the FFT for quite a while, because I use it often - and because it was fun.
My own working version uses dynamic arrays and is based on several libraries, but I stripped it down to keep it simple. The FFT_core routine attached takes a pointer to complex, a length parameter and a bit for the direction. The optimisation to pure REAL valued input vectors, as sufficient for most applications, requires separate pre- and postprocessing steps, and can additionally reduce the runtime by almost factor 2, but is NOT contained here.

It takes 12us for a 1024 points forward FFT on my machine, FPC 3.2.4, optimisation level O1, FPUTYPE SSE2, COREAVX, inlining ON.

Thaddy

  • Hero Member
  • *****
  • Posts: 18524
  • Here stood a man who saw the Elbe and jumped it.
Re: FFT performance
« Reply #32 on: October 21, 2025, 12:53:03 pm »
Interesting.
I find the swap unnecessary on 64bit, there's no rep.
Code: Pascal  [Select][+][-]
  1. procedure myswap (var a,b: complex); inline;  //swap 2 complex numbers. No need to avoid REPMOVSL
  2. begin
  3.   c := a;
  4.   a := b;
  5.   b := c;
  6. end;

Code: ASM  [Select][+][-]
  1.  #c := a;
  2.         vmovdqu (%rcx),%xmm0
  3.         vmovdqa %xmm0,(%rsp)
  4. # [40] a := b;
  5.         vmovdqu (%rdx),%xmm0
  6.         vmovdqu %xmm0,(%rcx)
  7. # [41] b := c;
  8.         vmovdqa (%rsp),%xmm0
  9.         vmovdqu %xmm0,(%rdx)

Another optimization might be my new version of bsr. That most definitely make a difference:
https://forum.lazarus.freepascal.org/index.php/topic,72503.msg568073.html#msg568073

I can't compile your code because of too many missing units.
« Last Edit: October 21, 2025, 01:10:44 pm by Thaddy »
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

Nitorami

  • Hero Member
  • *****
  • Posts: 604
Re: FFT performance
« Reply #33 on: October 21, 2025, 01:07:47 pm »
Quote
I find the swap unnecessary on 64bit, there's no rep.

Good ! Same with trunk. I just usually work on 32bit, and there we have the REPMOVSL at least up to FPC 3.2.4.

I will have a look at your BSR, but I don't expect a lot to be gained here. I tried a precalculated reverse indexing table before, and it made no measureable difference.

Thaddy

  • Hero Member
  • *****
  • Posts: 18524
  • Here stood a man who saw the Elbe and jumped it.
Re: FFT performance
« Reply #34 on: October 21, 2025, 01:18:18 pm »
The bsr is more like an intrinsic except it's one call one jump. there is another instruction on modern cpu's that I am now testing, lzcnt.
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

Nitorami

  • Hero Member
  • *****
  • Posts: 604
Re: FFT performance
« Reply #35 on: October 21, 2025, 01:30:46 pm »
Hm. I don't know assembler but your BSR (cardinal) routines from bsrbsf.pas in FPC3.2.4 32bit always return zero.... ?

Code: Pascal  [Select][+][-]
  1. uses bsrbsf;
  2.  
  3. var i,x: cardinal;
  4.  
  5. begin
  6.   for i := 1 to 5 do begin
  7.     x := 1 shl i;
  8.     writeln (x);
  9.     writeln ('BSR intrinsic : ',bsrdword(x), '    BSR macro: ',bsr (cardinal(x)));
  10.     writeln;
  11.   end;
  12. end.

Thaddy

  • Hero Member
  • *****
  • Posts: 18524
  • Here stood a man who saw the Elbe and jumped it.
Re: FFT performance
« Reply #36 on: October 21, 2025, 02:43:30 pm »
Nope,  your code:
Code: Text  [Select][+][-]
  1. 2
  2. BSR intrinsic : 1    BSR macro: 1
  3.  
  4. 4
  5. BSR intrinsic : 2    BSR macro: 2
  6.  
  7. 8
  8. BSR intrinsic : 3    BSR macro: 3
  9.  
  10. 16
  11. BSR intrinsic : 4    BSR macro: 4
  12.  
  13. 32
  14. BSR intrinsic : 5    BSR macro: 5
  15.  
  16.  
  17.  
  18. ------------------
  19. (program exited with code: 0)
  20.  
You must be doing something wrong, but I will check the version of the unit I posted.
BTW the cast is not necessary. There is an overload.
[edit] checked win and linux. your code outputs as expected.
« Last Edit: October 21, 2025, 02:51:48 pm by Thaddy »
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

Nitorami

  • Hero Member
  • *****
  • Posts: 604
Re: FFT performance
« Reply #37 on: October 21, 2025, 03:36:14 pm »
Odd. In 32bit, with FPC3.2.2 and 3.2.4, I always get zero, regardless what settings. Even when writing the function out as below, I get zero.

Code: Pascal  [Select][+][-]
  1. function bsr (n:cardinal): cardinal; nostackframe; assembler;
  2. asm
  3.   xor result, result
  4.   bsr result, n //always zero. Even mov result, n delivers zero.
  5. end;

Thaddy

  • Hero Member
  • *****
  • Posts: 18524
  • Here stood a man who saw the Elbe and jumped it.
Re: FFT performance
« Reply #38 on: October 21, 2025, 03:40:24 pm »
I tested to be sure it was not an intel/amd issue, but even on my 8+ years old dual core intel (ok, 64 bit) it works as intended.
It is only supposed to return 0 on 0 input and it does on my machines, modern Ryzen and old Intel.
You left out nostackframe, which is important. Did you use my unmodified code?
« Last Edit: October 21, 2025, 03:49:17 pm by Thaddy »
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

Nitorami

  • Hero Member
  • *****
  • Posts: 604
Re: FFT performance
« Reply #39 on: October 21, 2025, 03:47:00 pm »
Is this an issue of how parameters are passed to the function.

Code: Pascal  [Select][+][-]
  1. mov result, 3
  returns 3

Code: Pascal  [Select][+][-]
  1. mov result, n
  returns zero, regardless of what n is

Why ?

Thaddy

  • Hero Member
  • *****
  • Posts: 18524
  • Here stood a man who saw the Elbe and jumped it.
Re: FFT performance
« Reply #40 on: October 21, 2025, 03:50:10 pm »
Dunno, but it is wrong. What is your CPU?

Anyway, the first is an immediate: result,3
and the second is reg/reg: result, n
The inline assembler is smart enough to know which register n is in. See remark from PascalDragon.
Code: Pascal  [Select][+][-]
  1. {$mode objfpc}{$asmmode intel}
  2. function one(const n:dword):dword;assembler;nostackframe;
  3. asm
  4.   mov result, n
  5. end;
  6. // and
  7. function two:dword;assembler;nostackframe;
  8. asm
  9.   mov result, 3
  10. end;
  11. begin
  12.   writeln(one(3));
  13.   writeln(two());
  14. end.
both work
« Last Edit: October 21, 2025, 04:02:37 pm by Thaddy »
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

Nitorami

  • Hero Member
  • *****
  • Posts: 604
Re: FFT performance
« Reply #41 on: October 21, 2025, 03:52:54 pm »
12th Gen Intel Core i5-1245U.
I just found that it works if I remove the xor result, result.

Thaddy

  • Hero Member
  • *****
  • Posts: 18524
  • Here stood a man who saw the Elbe and jumped it.
Re: FFT performance
« Reply #42 on: October 21, 2025, 04:04:17 pm »
replace it by
Code: Pascal  [Select][+][-]
  1.  mov result,0  //reg imm
  2. // or
  3. sub result,result //reg,reg
but the xor should simply work. Most used for register clearance.
sub passes all the same tests.
« Last Edit: October 21, 2025, 04:09:52 pm by Thaddy »
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

Nitorami

  • Hero Member
  • *****
  • Posts: 604
Re: FFT performance
« Reply #43 on: October 21, 2025, 04:15:19 pm »
Same result: Zero, when using sub result, result.
Why do we need to clear result, anyway ? Won't it be overwritten then by bsr result, n?

EDIT: Isn't n passed to the function in EAX, and result is also EAX ? Then xor result, result would simply set the argument to zero.

I just read this - "Important to note, however, that the registers used for a 32-bit resp. a 64-bit target are not the same! On a 32-bit platform, the registers used for the arguments are EAX and EDX, the function return value having to be in EAX. On a 64-bit platform, the registers used for the arguments are RCX, RDX, R8 and R9, the function return value having to be in RAX."
That would explain why the code does not work in 32 bit mode.
« Last Edit: October 21, 2025, 04:24:49 pm by Nitorami »

Thaddy

  • Hero Member
  • *****
  • Posts: 18524
  • Here stood a man who saw the Elbe and jumped it.
Re: FFT performance
« Reply #44 on: October 21, 2025, 04:26:36 pm »
No, that's fine. But I can replicate it now for 32 bit, which means I need to adapt the code a little.
Also note that clearing the register is only to avoid false results when value = 0;
32 bit win > value is in eax, assuming dword
64 bit win > value is in ecx
in both result = in eax
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

 

TinyPortal © 2005-2018