I finally got my audio plugin limping along in Pascal to where I can do some preliminary tests. Once I got it that I was not debugging my program but rather my knowledge and understanding of Pascal, things started falling into place. The code being tested is more or less a direct translation.
Here are the results just from eyeballing the DAW's CPU meter while idling and then running a loop using a more demanding patch:
| | C++ | Opt. C++ | Pascal | Opt. Pascal |
| Idle | 6% | 3% | 9% | 7% |
| Run | 8% | 4.5% | 14% | 12% |
As you can see, the code on C++ improves with optimization by 40-50%. Pascal improves by only 10-20% and is still slower than the unoptimized C++.
Here are the compiler switches that I've used for the test, which are basically on par with the clang settings, targeting 64-bit macOS 10.6+:
fpc -O4 -OoAUTOINLINE -Xs -XX -Sv -Si -CfSSSE3 -CpCOREI
If I can improve the optimized Pascal to running at about 6%, then I would be satisfied with adopting it over C++ for cross-platform audio development. The advantages I've found so far using just Geany+FPC over Mac+Xcode+Clang/LLDB + PC+CodeBlocks+GCC/GDB are huge! (64-bit Cocoa Lazarus just isn't there yet for me.

)
Short of hand-tuning assembly, are there any other tricks I can use to lure the compiler into better code? I have converted all the C++ consts to constrefs, made sure any records or arrays have been passed by reference, tried Move and Fill over plain for loops, -OoUnrollLoops, -OoFastMath, tried pointer math over arrays, etc. All the little things that used to work on older, less intelligent compilers. I got about .25-.5% improvement with Move, which is used heavily, but FillQWord turned out to be probably slower than plain for loops. Nothing else seems to work. Using plain "fpc -O2" with nothing else gives me the exact same result!

I'm not looking for miracles here, but my preliminary tests showed that the mathy bits were on par with C code, so there's probably other housekeeping stuff going on that could be minimized.