FillQWord is likely to be the one of the Fill* routines that's the least optimized. I'd suggest you to use FillChar instead, which is usually the best optimized one.
Haha! I guess I shouldn't believe everything I read! Per the manual for FillByte:
When the size of the memory location to be filled out is a multiple of 2 bytes, it is better
to use Fillword, and if it is a multiple of 4 bytes it is better to use FillDWord, these routines are
optimized for their respective sizes.
I assume that applies to FillQWord as well. However, the FillQWord did slow things, but I'll give FillByte a whirl.
Well, optimized for the size does not necessarily mean optimized overall as well.

In fact for
FillQWord usually no assembly implementation exists. Even
FillDWord and
FillWord don't have an assembly in most targets implementation either. Only
FillByte is usually implemented as assembly.
FillQWord is likely to be the one of the Fill* routines that's the least optimized. I'd suggest you to use FillChar instead, which is usually the best optimized one.
Wow! I just tested it. FillByte/Char is about 1.5x faster!
Who do I call to complain? 
Mere complaining probably won't help (except someone then happens to feels motivated to look at it). Either try yourself to speed up the generic
FillQWord or provide an assembly implementation for it.

To be honest, abusing variant records/unions to escape typing in Pascal is at least 40 years old. If not 50.
Keep in mind that to use such tricks the result first has to be stored into memory (into the union record). This is already a bottleneck. A good intrinsic that operates on a registers should easily beat it.
In this specific case a typecast would work as well as it already resides in memory:
if TDoubleRec(WaveL[i][j]).B[63] then WaveL[i][j] := Sign(WaveL[i][j]) * 2 - WaveL[i][j];
As an aside: Would this be considered a bug? In my test, Trunc was 2x faster where they should be more or less identical in speed.
The problem is that on all the x86 targets
except Win64 the floating point related functions (e.g.
Trunc,
Frac, etc.;
Floor is implemented using
Trunc and
Frac) are implemented using the x87 FPU instead of SSE, because the functions work for the highest available precision which is
Extended which requires the x87 FPU. This is still open to investigation to implement this in a satisfying way to all the use of SSE for these functions as well.