Just a thing i'm not understing well is your trick with "movhlps xmm1,xmm0 " it an issue with stack, but something escapes me. can you re-explain me ?
Ok this is all to do with return conventions in linux 64 ( SysV x86_64 to be exact), just as win64 has it's 4 registers rest on stack etc.
Spec was kindly sourced by CuriousKit as from this post:
That's a little confusing with Linux, because the way it's behaving implies that it's splitting the 128-bit into two, classing the lower half as SSE and the upper half as SSEUP (see pages 15-17 here: http://refspecs.linuxbase.org/elf/x86_64-abi-0.21.pdf ), but then converting SSEUP to SSE because it thinks it isn't preceded by an SSE argument (which it does... the lower two floats). Maybe my interpretation is wrong, but it shouldn't need to split it across 2 registers like that. Can someone with more experience of the Linux ABI shed some light on that?
There are two type identifiers for SSE values as parameters.
X86_64_SSE_CLASS This signifies[is a pointer / address of] the first 64 bits of a 128 bit SSE value
X86_64_SSEUP_CLASS This signifies[is a pointer / address of] the next 64 bits of a 128 bit SSE value
There can be more that one of these for 256 bit.
One thing you have to keep in the back of your mind in any unix environment when writing code at this level is you have to take into account endianness and not write code which is based on one arch. So this seems to be the way that Unix V deals with this and thus gcc does, and therefore everyone else does. (you would want your libs to link wouldn't you?)
Anyway as seen from the fpc compiler code:
s128real:
begin
classes[0].typ:=X86_64_SSE_CLASS;
classes[0].def:=carraydef.getreusable_no_free(s32floattype,2);
classes[1].typ:=X86_64_SSEUP_CLASS;
classes[1].def:=carraydef.getreusable_no_free(s32floattype,2);
result:=2;
end;
This is exactly how the compiler
see's a 128 bit real. So in general terms if we did not use nostackframe then at times the result was placed/wanted on the stack by the fpc compiler. Unlike other platforms ithe stack did not contain a pointer it had allocated 128 bits on the stack for the contents of the mmx reg to be copied to.
After the return from assembler the compiler then did a movq on each of the two qword and placed the X86_64_SSE_CLASS in low xmm0 and the X86_64_SSEUP_CLASS in low xmm1. It does this for routines it generates itself. here is the post amble of native pascal version of operator + that does not even use mmx regs for its calcs.
# Var $result located at rsp+0, size=OS_128
.........
# [158] End;
movq (%rsp),%xmm0
# Register xmm1 allocated
movq 8(%rsp),%xmm1
leaq 24(%rsp),%rsp
# Register rsp released
ret
# Register xmm0,xmm1 released
.Lc12:
When we use nostackframe the above postamble does not occur. Therefore we got good values for x and y [low xxm0] but garbage for z and w, the calling convention was taking whatever was in low xmm1.
So using a movhlps xmm1,xmm0 as the last instruction, post whatever you would do if you coded to leave result in xmm0 then ensures the unix abi is conformed to. and we get the right values back
Phew.. long post, I hope this makes sense to you Jerome.
Peter