Does your FFT routine use complex numbers? I have a routine that does it in place in a 1024 sample window, no complex numbers needed.

Yes. But I need mixed radix because I have a 400 sample window. But it would always be good to see a different solution.

The assembler now parallelizes most re:im operations though, and even does two complex per instruction about half the time. It is not production ready yet though, just preparation for a planned move to 64-bit of our only remaining 32-bit application.

This application uses a lot of floating point calculations and is 140-200% slower than on 64-bit (using Delphi btw). Worse, the number of calculations is only going to increase, so buying yourself out of trouble with newer hardware is costly.

That said I benchmark with Delphi mostly and Delphi converts every single load to double and back and does all main operations in double.

Most of the calculations are not repeated a lot or parallelize, so I mostly started analysing (creating a good benchmark) and tackling a few simple but common primitives, FFT included.