I only see that st0..st7 is saved across calls. Do I miss something where x87 is specified? All float examples seems to use SSE2 XMM. It is a long article, so maybe I misread, but could you quote the relevant bits ?
All float examples seems to use SSE2 XMM. Because Windows does not use the 87 FPU.
The alignment, if it was required for a call to Windows as i expose derives from :
func4(__m64 a, __m128 b, struct c, float d, __m128 e, __m128 f);
// a in RCX, ptr to b in RDX, ptr to c in R8, d in XMM3,
// ptr to f pushed on stack, then ptr to e pushed on stack
_m128 is passed by reference.
Somewhere else in MS doc it is stated that things are aligned on 64 bit. Some should be aligned on 128 bit, but here the fun is that cmem loses the 16 byte align due to the memsize being stored at the beginning of the allocated block ... That causes absolutely no trouble.
Anyway the Windows ABI has
nothing to say about how the compiler should lay out its
generated code. x87 FPU instructions are inlined by the compiler in the generated code (as is probably done in linux x86_64 or i386 floats).