-O<x> Optimizations:
-O- Disable optimizations
-O1 Level 1 optimizations (quick and debugger friendly)
-O2 Level 2 optimizations (-O1 + quick optimizations)
-O3 Level 3 optimizations (-O2 + slow optimizations)
-O4 Level 4 optimizations (-O3 + optimizations which might have unexpected side effects)
:) Hello :)
I have exactly the same problem on FPC 3.0.2 32 bits/windows. Tried with -Cp and -Op COREAVX/COREAVX2 and PENTIUMM.
:( I can't figure my own login info at bugs.freepascal :(
"The math functions of the D3DX utility library are deprecated for Windows 8. We recommend that you use DirectXMath instead."
class operator TXMFLOAT4.Add(a: TXMFLOAT4; b: TXMFLOAT4): TXMFLOAT4; var r: TXMFLOAT4; begin result.x:=a.x+b.x; result.y:=a.y+b.y; result.z:=a.z+b.z; result.w:=a.w+b.w; end;
Hello Sonny.
You may find (or not - not sure) some inspiration with AVX + FPC here:
https://sourceforge.net/p/cai/svncode/HEAD/tree/trunk/lazarus/libs/uvolume.pas
Plus some details here:
https://www.youtube.com/watch?v=qGnfwpKUTIQ
-al
-O3
-Sv
-OoFASTMATH
-CfAVX
-CpCOREAVX
-OpCOREAVX
-CPPACKRECORD=8
-CfAVX
-CpCOREAVX
-OpCOREAVX
SSE Operations : ---------------------------------------- V3 = V1 + V2 = (X: 7.50000 ,Y: 7.50000 ,Z: 7.50000 ,W: 1.00000) OK V3 = V1 - V2 = (X: 2.50000 ,Y: 2.50000 ,Z: 2.50000 ,W: 0.00000)OK V3 = V1 * V2 = (X: 12.50000 ,Y: 12.50000 ,Z: 12.50000 ,W: 0.25000) OK V3 = V1 / V2 = (X: 2.00000 ,Y: 2.00000 ,Z: 2.00000 ,W: 1.00000) OK ---------------------------------------- V3 = V1 + Float = (X: 6.50000 ,Y: 6.50000 ,Z: 6.50000 ,W: 2.00000) OK V3 = V1 - Float = (X: 3.50000 ,Y: 3.50000 ,Z: 3.50000 ,W: -1.00000)OK V3 = V1 * Float = (X: 7.50000 ,Y: 7.50000 ,Z: 7.50000 ,W: 0.75000) OK V3 = V1 / Float = (X: 3.33333 ,Y: 3.33333 ,Z: 3.33333 ,W: 0.33333) OK |
AVX Operations : ---------------------------------------- V3 = V1 + V2 = (X: 7.50000 ,Y: 7.50000 ,Z: 7.50000 ,W: 1.00000) OK V3 = V1 - V2 = (X: -2.50000 ,Y: -2.50000 ,Z: -2.50000 ,W: 0.00000) NOK V3 = V1 * V2 = (X: 12.50000 ,Y: 12.50000 ,Z: 12.50000 ,W: 0.25000) OK V3 = V1 / V2 = (X: 0.50000 ,Y: 0.50000 ,Z: 0.50000 ,W: 1.00000) NOK ---------------------------------------- V3 = V1 + Float = (X: 6.50000 ,Y: 6.50000 ,Z: 6.50000 ,W: 2.00000) OK V3 = V1 - Float = (X: -3.50000 ,Y: -3.50000 ,Z: -3.50000 ,W: 1.00000) NOK V3 = V1 * Float = (X: 7.50000 ,Y: 7.50000 ,Z: 7.50000 ,W: 0.75000) b]OK[/b] V3 = V1 / Float = (X: 0.30000 ,Y: 0.30000 ,Z: 0.30000 ,W: 3.00000) NOK |
Nothing declared as packed, also, because like I said before as far as I can tell it has absolutely no effect on anything in this case.As far as you can tell?... Always use packed when you want to defeat the compiler by using inline assembler.... Otherwise you are in trouble before you know it.
Well, apart from the inconsequential command line.....< >:D>
Your problem is with V3.
It would be really helpful if you examine the actual assembler output. So: -a option and examine .s
Also note you need to pack both arrays and records...
You just need to invert A and B in the assembler for subtraction and division. Basically it should look like this:Yes it was the problem, the inversion, but you tricks is not the real solution ( don't work with the 2nd overloaded operators (V:TheVector;F:Single)
class operator TXMFloat4.-(constref A, B: TXMFloat4): TXMFloat4; assembler; asm VMOVAPS XMM0,[B] VMOVAPS XMM1,[A] VSUBPS XMM0,XMM1, XMM0 VMOVAPS [RESULT], XMM0 end;
Note how B goes first in the second two operators. Tested both of these and again, they work fine. Nothing declared as packed, also, because like I said before as far as I can tell it has absolutely no effect on anything in this case.
Yes it was the problem, the inversion, but you tricks is not the real solution ( don't work with the 2nd overloaded operators (V:TheVector;F:Single)
I've made some research on instructions and test So the good is :
[c]VSUBPS XMM0,XMM0, XMM1[/c] where the 1st param is the result and not the 3rd as I thought.
Now the results are ok
Thanks
Hi, Thanks Akira, I did not think about it but it's not resolve the problem on my pc i have always a SIGSEGV and this message :
project1.lpr(22,0) Warning: Object file "unit1.o" contains 32-bit absolute relocation to symbol ".data.n_tc_$unit1_$$_nullvector4f".
Something is wrong with my configuration
Try
MOVAPS XMM0,[RIP+NullVector4F]
Hi Jerome,
Nice work with the Vector lib, thats the cleanest Pascal code I have come across for vector math, demonstates what advanced records can really do.
Tested on Win7 64 on both my AMD and Intel desktops with no problems, plugging the number ranges I use into some of the Vectors thankfully I see no loss of precision using the rsqrtps in normalize.
Loaded it into a Linux VM and I have made your nice neat code all messy with some Unix defines (I selected Unix for now as I am about to upgrade my FreeBSD boxen to test there). This is still for 64 bit linux not tested in 32bit as I have no 32 bit OSes any more.
Peter
@Peter : Macro is good solution i think to, but not work with me >:(
Ok I did a quick google around and it would seem there is no guaranteed way to force the compiler to put the parameters into a register.
Could someone confirm this please? It would be really good if I was wrong.
.section .text.n_glzvectormath_new$_$tglznativevector4f_$__$$_combine2$crcf2601943,"x"
.balign 16,0x90
.globl GLZVECTORMATH_NEW$_$TGLZNATIVEVECTOR4F_$__$$_COMBINE2$crcF2601943
GLZVECTORMATH_NEW$_$TGLZNATIVEVECTOR4F_$__$$_COMBINE2$crcF2601943:
.Lc126:
# Temps allocated between rbp-16 and rbp+0
.seh_proc GLZVECTORMATH_NEW$_$TGLZNATIVEVECTOR4F_$__$$_COMBINE2$crcF2601943
# Register rbp allocated
.Ll257:
# [960] begin
pushq %rbp
.seh_pushreg %rbp
.Lc128:
.Lc129:
movq %rsp,%rbp
.Lc130:
leaq -48(%rsp),%rsp
.seh_stackalloc 48
# Var V2 located in register r8
# Var F1 located in register r9
# Var F2 located in register rcx
# Var $self located in register rax
# Var $result located in register rdx
# Temp -16,16 allocated
.seh_endprologue
# Register rcx,rdx,r8,r9,rax allocated
.section .text.n_glzvectormath_new$_$tglzssevector4f_$__$$_combine2$tglzssevector4f$single$single$$tglzssevector4f,"x"
.balign 16,0x90
.globl GLZVECTORMATH_NEW$_$TGLZSSEVECTOR4F_$__$$_COMBINE2$TGLZSSEVECTOR4F$SINGLE$SINGLE$$TGLZSSEVECTOR4F
GLZVECTORMATH_NEW$_$TGLZSSEVECTOR4F_$__$$_COMBINE2$TGLZSSEVECTOR4F$SINGLE$SINGLE$$TGLZSSEVECTOR4F:
.Lc403:
.seh_proc GLZVECTORMATH_NEW$_$TGLZSSEVECTOR4F_$__$$_COMBINE2$TGLZSSEVECTOR4F$SINGLE$SINGLE$$TGLZSSEVECTOR4F
# Register rbp allocated
.Ll832:
# [2233] asm
pushq %rbp
.seh_pushreg %rbp
.Lc405:
.Lc406:
movq %rsp,%rbp
.Lc407:
leaq -32(%rsp),%rsp
.seh_stackalloc 32
# Var V2 located in register r8
# Var F1 located in register r9
# Var $self located in register rcx
# Var $result located in register rdx
.seh_endprologue
# Var F2 located at rbp+48, size=OS_64
# Register rax,rcx,rdx,r8,r9,r10,r11 allocated
.section .text.n_glzvectormath_new$_$tglzssevector4f_$__$$_combine2$tglzssevector4f$single$single$$tglzssevector4f,"x"
.balign 16,0x90
.globl GLZVECTORMATH_NEW$_$TGLZSSEVECTOR4F_$__$$_COMBINE2$TGLZSSEVECTOR4F$SINGLE$SINGLE$$TGLZSSEVECTOR4F
GLZVECTORMATH_NEW$_$TGLZSSEVECTOR4F_$__$$_COMBINE2$TGLZSSEVECTOR4F$SINGLE$SINGLE$$TGLZSSEVECTOR4F:
.Lc403:
# Var V2 located in register r8
# Var F1 located in register r9
# Var $self located in register rcx
# Var $result located in register rdx
# Var F2 located at rbp+48, size=OS_64
# [2233] asm
# Register rax,rcx,rdx,r8,r9,r10,r11 allocated
macros do not expand inside asm blocks
note we'll see the stack size is the problemdon't quite understand that bit.
Quotenote we'll see the stack size is the problemdon't quite understand that bit.
You move the value to the stack. You should probably move to the pointer that is ON the stack.
so
mov rax,48(%rbp) // or whatever free register. movss (rax),%xmm2
Hi Jerome,
I have been playing around with this some more in Linux. I can get the SSECombine2 down to just the following (from your little test code)
whereas the optimum for windows would be
function SSECombine2(constref V1, V2: TGLZVector4f; Const F1,F2: Single): TGLZVector4f;assembler; asm movups xmm0, [V1] movups xmm1, [V2] movss xmm2, [F2{%H-}] shufps xmm3, xmm3, $00 // replicate shufps xmm2, xmm2, $00 // replicate mulps xmm0, xmm3 // Self * F1 mulps xmm1, xmm2 // V2 * F2 addps xmm0, xmm1 // (Self * F1) + (V2 * F2) andps xmm0, [RIP+cSSE_MASK_NO_W] movups [RESULT], xmm0 end;
# Var V2 located in register r8
# Var F1 located in register r9
# Var $self located in register rcx
# Var $result located in register rdx
.seh_endprologue
# Var F2 located at rbp+48, size=OS_64
Might it not be better to have two inc files for the implementation which are linux and win specific and can be optimized according to their respective abis?
FWIW, while this thread was running, I've been playing with SSE (albeit in Delphi, since for work) too in the past two weeks, so I thought I post some code.
It is more of an integer SSSE3 routine, rotating a block of 8x8 bytes with a loop around it for a bit of loop tiling. See rot 8x8 here (http://www.stack.nl/~marcov/rot8x8.txt).
The related stackoverflow thread is at why does haswell+ suck? (https://stackoverflow.com/questions/47478010/sse2-8x8-byte-matrix-transpose-code-twice-as-slow-on-haswell-then-on-ivy-bridge')
On the subject, I wrote a load of SSE, AVX and FMA routines primarily for graphics programming, namely taking an array of vectors and transforming them by a 4x4 matrix, for example. Would any of those be useful for your collection or for Lazarus in general? There's still some room for improvement though, since I don't take advantage of memory alignment.
I know the topic is mostly on compiler support and optimisation, but is it worth having some standardised vector and matrix functions that make use of SSE and AVX if available? I know FPC has some 2, 3 and 4-component vector and matrix functions, but they're very generalised and not particularly fast when dealing with large datasets.
It work but not in the Advanced Record :
Quote
# Var V2 located in register r8
# Var F1 located in register r9
# Var $self located in register rcx
# Var $result located in register rdx
.seh_endprologue
# Var F2 located at rbp+48, size=OS_64
Actually the only thing that solve the problem is by surrounding Asm..End block by a Begin..End :'(
...I'll see what I can do.On the subject, I wrote a load of SSE, AVX and FMA routines primarily for graphics programming, namely taking an array of vectors and transforming them by a 4x4 matrix, for example. Would any of those be useful for your collection or for Lazarus in general? There's still some room for improvement though, since I don't take advantage of memory alignment.
I know the topic is mostly on compiler support and optimisation, but is it worth having some standardised vector and matrix functions that make use of SSE and AVX if available? I know FPC has some 2, 3 and 4-component vector and matrix functions, but they're very generalised and not particularly fast when dealing with large datasets.
Yes it's welcome your, code could be help a lot. (and not only me, i'm sure) Perhaps if you are agree i'll can try to implement yours functions in GLScene and my own project (a new GLScene, with it's own fast bitmap management. And will support opengl core, and vulkan 8) )
Cheers
Ok getting near to beer o'clock, so here is the latest it works in win32 win64, linux32 linux64.
I have cleaned up the starting defs but not the return defs, 32bit was not that bad and has not added too.... much crud, not as much as I though it might.
The only numbers that are wrong are those that have always been wrong i.e. Perpendicular, looks like it needs negating but I will leave that to you Jerome to make a call on that.
Peter
Edit put the right file there :-[
The only numbers that are wrong are those that have always been wrong i.e. Perpendicular, looks like it needs negating but I will leave that to you Jerome to make a call on that.Corrected, simple operand inversion in the last operation (subps)
| RELEASE | | RELEASE_SSE | | RELEASE_SSE3 | | RELEASE_SSE4 | | RELEASE_AVX | | RELEASE_AVX2 | |
NATIVE | | 4,769 | | 4,500 | | 4,575 | | 4,563 | | 4,793 | | 4,759 |
SSE | | 2,042 | | 2,038 | | 2,038 | | 2,038 | | 1,997 | | 1,973 |
SSE 3 | | 2,009 | | 1,998 | | 1,999 | | 2,005 | | 1,992 | | 1,977 |
SSE 4 | | 1,991 | | 1,982 | | 1,987 | | 1,979 | | 1,965 | | 1,961 |
AVX | | 2,308 | | 2,298 | | 2,293 | | 2,291 | | 2,200 | | 2,181 |
vectormath_sse_imp.inc(9,3) Error: This function's result location cannot be encoded directly in a single operand when "nostackframe" is used
vectormath_sse_imp.inc(9,3) Error: Asm: [movups reg64,xmmreg] invalid combination of opcode and operands
MOVUPS XMM0,[A] MOVUPS XMM1,[B] SUBPS XMM0,XMM1 MOVUPS [RESULT], XMM0 <-- It does not like this line
Works fine in win64 will try to compare some asm output and see what it is trying to do.
Ok more testing and this time not good news :( It worried me we might be getting correct results only because of registers being in a certain state from the last calc, so I added
StartTimer; For cnt:= 1 to 20000000 do begin v3 := v1 + v2; v4 := v1 + v1; end; StopTimer; With StringGrid1 do begin Cells[1,1] := v3.ToString + v4.ToString ; Cells[2,1] := WriteTimer; End;
# Var A located in register rdx
# Var B located in register r8
# Var $result located in register rcx
# Var $self located in register rcx
# Var $result located in register xmm0
Add -a<X> compiler options to the project and check the generated .S file it will give you some clue on about register and stack are use with linux64
# Temp -16,16 allocated
# Var $result located at rbp-16, size=OS_128
# Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 allocated
I'll have a look, we already had to do that for the consts as we hit bad alignments, it may change the behaviour of the calls, interesting question ;DIt's a thought because if such memory alignment is forced (such vectors should be aligned that way anyway, because they're 16 bytes in length overall), you can potentially replace your MOVUPS calls with MOVAPS calls for an extra speed gain.
so I do not know how Jerome is getting good returns in Win10?
Request to BeanzMaster - as well as timing checks, can you also implement some verification in your benchmark program? I have a feeling that some functions return incorrect results. Failing that, I can possibly design something a little more in-depth once I've finished my current task.
I've tested 32bit with Lazarus 1.8rc3 but some errors occured :
1st the clamp functions work but raise a SIGSEV just after
2nd the function with single result. The result is stored in ST register, i tried to set it with FTSP intruction, but without success
That is usually a sign of stack corruption, such as moving a whole 128 bit mmx reg when there is only space for 32 or 64 bytes. Usually I have found that if the variable is on the stack in 32 bits the stack contains a pointer and not the variable so need a
mov eax, stackedvar
mov [eax], xmm reg
.globl GLZVECTORMATH$_$TGLZVECTOR4F_$__$$_DISTANCE$TGLZVECTOR4F$$SINGLE
GLZVECTORMATH$_$TGLZVECTOR4F_$__$$_DISTANCE$TGLZVECTOR4F$$SINGLE:
# Register ebp allocated
# [258] Asm
pushl %ebp
movl %esp,%ebp
leal -4(%esp),%esp
# Var A located in register edx
# Var $self located in register eax
# Temp -4,4 allocated
# Var $result located at ebp-4, size=OS_F32
# Register eax,ecx,edx allocated
# Var A located in register eax
# Var B located in register edx
# Var $result located in register ecx
# Register eax,ecx,edx allocated
Those errors are boring >:D So perhaps by making and external object library with masm or nasm/yasm, will be better than use internal asm ???
mov ecx, RESULT
mov [ecx], xmm0
not working : vectormath_vector_win32_sse_imp.inc(269,5) Error: Asm: [mov mem??,xmmreg] invalid combination of opcode and operands
and this is what i have in the S file :Quote.globl GLZVECTORMATH$_$TGLZVECTOR4F_$__$$_DISTANCE$TGLZVECTOR4F$$SINGLE
GLZVECTORMATH$_$TGLZVECTOR4F_$__$$_DISTANCE$TGLZVECTOR4F$$SINGLE:
# Register ebp allocated
# [258] Asm
pushl %ebp
movl %esp,%ebp
leal -4(%esp),%esp
# Var A located in register edx
# Var $self located in register eax
# Temp -4,4 allocated
# Var $result located at ebp-4, size=OS_F32
# Register eax,ecx,edx allocated
another example this do not work too
QuoteThose errors are boring >:D So perhaps by making and external object library with masm or nasm/yasm, will be better than use internal asm ???
Still have to conform to pascal calling conventions so not much gain in doing so probably spend more time trying to get your params to your lib correctly..
I am writing some test cases, mark what is bad carry on coding and I'll try to sort out the 'annoying' errors.
As for this I have got this in unix64 should work for win64 I think from previous testing.
class operator TGLZVector4f.+(constref A: TGLZVector4f; constref B:Single): TGLZVector4f; assembler; nostackframe; register; asm movaps xmm0,[A] movss xmm1,[B] shufps xmm1, xmm1, $00 addps xmm0,xmm1 movhlps xmm1,xmm0 end;
Re comparison operators, in the pure pascal code as I read it every element must pass the comparison test, that was not happening in the case that one element failed in the asm. So it passed my tests with the following which also avoids branching. Comments please before I change a lot of code.
cmpps xmm0, xmm1, cSSE_OPERATOR_LESS_OR_EQUAL movmskps eax, xmm0 // copies a 4 bit mask to eax xor eax, $f // only 1111 should should be correct for anded compares. setz al // true if zero
Edit 1 Negate fails tests that mask is doing a multiply by -1 not setting all items negative as the pascal code. Though I suspect the pascal code is wrong. Never had a use for setting all negative whereas *-1 is vector reverse.
I'm having some difficulty compiling the latest version of the unit from BeanzMaster - the GLZTypes unit has an awkward dependency on GLZVectorMath and others, since TGLZVector and TGLZVector2i are not defined. It's easy enough to fix, but it means that GLZTypes is not self-contained.
For negate you have right, under 64bit the result is wrong normaly in our sample the sign of the Y value should change. Under 32bit the function return the correct result.
For X*-1 is equal as 0 - X so i've choose this latest Sub is normaly fastest than Mul.
've tested it work, but result is wrong
if v1 = v2 then Cells[1,25] := 'TRUE' else Cells[1,25] := 'FALSE';
the ZEROFLAG is not set under 64bit so always return TRUE, but with 32bit your function is ok and return the right result
Vector Reflects do not match : Native = (X: 171.54222 ,Y: 677.06671 ,Z: 489.74261 ,W: 107.84930) --> SSE = (X: 171.54224 ,Y: 677.06677 ,Z: 489.74265 ,W: 107.84931)Like you see the result is very, very near. For AngleBeetween SSE return me NAN :'(
So are you saying the Native Pascal code is wrong?Quote
Yes, in real for me, with the the native code should be like invert or class operator -
This 2 code give me the same result now :
procedure TNativeGLZVector4f.pNegate; begin //if Self.X>0 then Self.X := -Self.X; //if Self.Y>0 then Self.Y := -Self.Y; //if Self.Z>0 then Self.Z := -Self.Z; //if Self.W>0 then Self.W := -Self.W; end; procedure TGLZVector4f.pNegate; assembler; nostackframe; register; asm movaps xmm0,[RCX] xorps xmm0, [RIP+cSSE_MASK_NEGATE] movaps [RCX],xmm0 End;
but i'm little disturb by this because with my previous test result was correct to the native, like we see here http://forum.lazarus.freepascal.org/index.php/topic,32741.msg267332.html#msg267332 (http://forum.lazarus.freepascal.org/index.php/topic,32741.msg267332.html#msg267332)
on the 2nd screenshot (on the 1st screenshot result are different) :o so now i don't say exactly what's the real correct result
I'm also synchronize with your UNIX64_SSE, the EQUAL function and this is work (not tested with SSE4 but should work to)
class operator TGLZVector4f.= (constref A, B: TGLZVector4f): boolean; assembler; nostackframe; register; asm movaps xmm1,[A] movaps xmm0,[B] {$IFDEF USE_ASM_SSE_4} cmpps xmm0,xmm1, cSSE_OPERATOR_EQUAL ptest xmm0, xmm1 jnz @no_differences mov [RESULT],FALSE jmp @END_SSE {$ELSE} cmpps xmm0, xmm1, cSSE_OPERATOR_EQUAL // Yes: $FFFFFFFF, No: $00000000 ; 0 = Operator Equal movmskps eax, xmm0 test eax, eax setnz al {$ENDIF} end;
Tommorrow if i have the time i'll make testunit with Win32
Many thanks and great work Peter like always 8-)
ector Reflects do not match : Native = (X: 171.54222 ,Y: 677.06671 ,Z: 489.74261 ,W: 107.84930) --> SSE = (X: 171.54224 ,Y: 677.06677 ,Z: 489.74265 ,W: 107.84931)
Ok I just tested the code I provided before on win64 and it works for me.
cmpps xmm0, xmm1, cSSE_OPERATOR_LESS_OR_EQUAL // Yes: $FFFFFFFF, No: $00000000 ; 2 = Operator Less or Equal movmskps eax, xmm0 xor eax, $F setz al
What gets returned in EAX is a mask of matched tests. So you could get 1010 in EAX which means x and z are less or equal but y and w are greater.
Though the test runner is so SLOW in windows.
22:58:04 - Running All Tests 22:58:16 - Number of executed tests: 61 Time elapsed: 00:00:12.436
compared to linux
2:13:25 - Running All Tests 12:13:26 - Number of executed tests: 61 Time elapsed: 00:00:00.149
But always 1 failure --> Vector AngleBetweens do not match : 1.932 --> Nan
-CfSSE3
-Sv
-dUSE_ASM
Just a thing i'm not understing well is your trick with "movhlps xmm1,xmm0 " it an issue with stack, but something escapes me. can you re-explain me ?
That's a little confusing with Linux, because the way it's behaving implies that it's splitting the 128-bit into two, classing the lower half as SSE and the upper half as SSEUP (see pages 15-17 here: http://refspecs.linuxbase.org/elf/x86_64-abi-0.21.pdf ), but then converting SSEUP to SSE because it thinks it isn't preceded by an SSE argument (which it does... the lower two floats). Maybe my interpretation is wrong, but it shouldn't need to split it across 2 registers like that. Can someone with more experience of the Linux ABI shed some light on that?
EDIT : I'm also tried to make compare with sse4 PTEST instruction but don't say how without a jump.
...
Whether you need a jump or not depends on the code. If you just need to set a result based on the zero flag, then you can use SETZ or SETNZ. There's no straight answer.
...
This may be a case for simple and quick routines v pedantic routines, Allow choice. TBH personally I would aways go for simple and quick and test for edge cases before main calcs where it is needed.
Ok this is all to do with return conventions in linux 64 ( SysV x86_64 to be exact), just as win64 has it's 4 registers rest on stack etc.
...
When we use nostackframe the above postamble does not occur. Therefore we got good values for x and y [low xxm0] but garbage for z and w, the calling convention was taking whatever was in low xmm1.
So using a movhlps xmm1,xmm0 as the last instruction, post whatever you would do if you coded to leave result in xmm0 then ensures the unix abi is conformed to. and we get the right values back ;)
Phew.. long post, I hope this makes sense to you Jerome.
Test | Native | Assembler |
Vector Op Add Vector | 0.239001 | 0.066999 |
Vector Op Add Single | 0.553000 | 0.070000 |
Add Vector To Self | 0.105000 | 0.101000 |
Add Single To Self | 0.101000 | 0.099000 |
Test | Native | Assembler |
Vector Length | 0.086000 | 0.233000 |
Test | Native | Assembler |
Vector Length | 0.086000 | 0.101000 |
Test | Native | Assembler |
Vector Length | 0.081000 | 0.095000 |
Test | Native | Assembler |
Vector Length | 0.083000 | 0.081000 |
| Test | | Native | | Assembler | | Gain in % |
| Vector Op Subtract Vector | | 0.114000 | | 0.048000 | | 57.895 % |
| Vector Op Add Vector | | 0.118000 | | 0.050000 | | 57.627 % |
| Vector Op Multiply Vector | | 0.116000 | | 0.049000 | | 57.758 % |
| Vector Op Divide Vector | | 0.136000 | | 0.055000 | | 59.559 % |
| Vector Op Add Single | | 0.118000 | | 0.050000 | | 57.627 % |
| Vector Op Subtract Single | | 0.114000 | | 0.051000 | | 55.263 % |
| Vector Op Multiply Single | | 0.118000 | | 0.051000 | | 56.780 % |
| Vector Op Divide Single | | 0.136000 | | 0.055000 | | 59.559 % |
| Vector Op Negative | | 0.119000 | | 0.048000 | | 59.664 % |
| Vector Op Equal | | 0.047000 | | 0.042000 | | 10.637 % |
| Vector Op GT or Equal | | 0.049000 | | 0.050000 | | -2.042 % |
| Vector Op LT or Equal | | 0.047000 | | 0.043000 | | 8.511 % |
| Vector Op Greater | | 0.051000 | | 0.050000 | | 1.960 % |
| Vector Op Less | | 0.048000 | | 0.042000 | | 12.501 % |
| Vector Op Not Equal | | 0.120000 | | 0.050000 | | 58.334 % |
| Add Vector To Self | | 0.090000 | | 0.088000 | | 2.222 % |
| Sub Vector from Self | | 0.088000 | | 0.088000 | | 0.000 % |
| Multiply Vector with Self | | 0.088000 | | 0.090000 | | -2.273 % |
| Divide Self by Vector | | 0.105000 | | 0.107000 | | -1.905 % |
| Add Single To Self | | 0.091000 | | 0.090000 | | 1.098 % |
| Sub Single from Self | | 0.088000 | | 0.088000 | | 0.000 % |
| Multiply Self with single | | 0.088000 | | 0.089000 | | -1.137 % |
| Divide Self by single | | 0.105000 | | 0.105000 | | 0.001 % |
| Invert Self | | 0.068000 | | 0.066999 | | 1.472 % |
| Negate Self | | 0.068000 | | 0.066999 | | 1.472 % |
| Self Abs | | 0.067000 | | 0.068000 | | -1.493 % |
| Self Normalize | | 0.410000 | | 0.339000 | | 17.317 % |
| Self Divideby2 | | 0.113000 | | 0.093000 | | 17.699 % |
| Self CrossProduct Vector | | 0.275000 | | 0.188000 | | 31.636 % |
| Self Min Vector | | 0.078000 | | 0.068000 | | 12.821 % |
| Self Min Single | | 0.069000 | | 0.068000 | | 1.450 % |
| Self Max Vector | | 0.080000 | | 0.069000 | | 13.749 % |
| Self Max Single | | 0.067000 | | 0.069000 | | -2.985 % |
For AVX, this is the right functions for Distance and Length, your code is based on SS3 instructions, so it's less speed
Anyway my priority is to finsh the test, I see you have added one of the features I have planned, ( Gain in % ) the other I want to add is report the accuracy to how many dp. Probably more important with larger routines than we are doing now.
I have a question about the comparison operators. Should they return true only if all of the elements are true? (e.g. Input1 = Input2 only if all the elements match, and Input1 < Input2 only if all the elements in Input1 are smaller than Input2) Or are they designed to return true if at least one of the elements are equal, for example.
It is now getting a bit complicated to use the forum as a source sharing device, do you have any sort of github or other source server where collaboration would be a little easier?
DIRECTX.MATH_$$_XMVECTORSETBINARYCONSTANT$LONGWORD$LONGWORD$LONGWORD$LONGWORD$$TXMVECTOR:
.Lc128:
.Ll314:
# [2903] g_vMask1: TXMVECTORU32 = (u: (1, 1, 1, 1));
pushl %ebp
.Lc130:
.Lc131:
movl %esp,%ebp
.Lc132:
# Var C0 located in register eax
# Var C1 located in register edx
# Var C2 located in register ecx
# Var C3 located at ebp+12, size=OS_32
# Var $result located at ebp+8, size=OS_32
.Ll315:
# [2907] movd xmm0,dword ptr [C3]
movd 12(%ebp),%xmm0
.Ll316:
# [2908] movd xmm1,dword ptr[C2]
movd (%ecx),%xmm1
.Ll317:
# [2909] movd xmm2,dword ptr[C1]
movd (%edx),%xmm2
.Ll318:
# [2910] movd xmm3,dword ptr[C0]
movd (%eax),%xmm3
.Ll319:
# [2911] punpckldq xmm3,xmm1
punpckldq %xmm1,%xmm3
.Ll320:
# [2912] punpckldq xmm2,xmm0
punpckldq %xmm0,%xmm2
.Ll321:
# [2913] punpckldq xmm3,xmm2 // XMM3 = vTemp
punpckldq %xmm2,%xmm3
.Ll322:
# [2915] PAND XMM3, [g_vMask1] // vTemp = _mm_and_si128(vTemp,g_vMask1);
pand TC_$DIRECTX.MATH$_$XMVECTORSETBINARYCONSTANT$crcD1D7FBA5_$$_G_VMASK1,%xmm3
.Ll323:
# [2917] PCMPEQD XMM3, [g_vMask1] // vTemp = _mm_cmpeq_epi32(vTemp,g_vMask1);
pcmpeqd TC_$DIRECTX.MATH$_$XMVECTORSETBINARYCONSTANT$crcD1D7FBA5_$$_G_VMASK1,%xmm3
.Ll324:
# [2919] PAND XMM3, [g_XMOne] // vTemp = _mm_and_si128(vTemp,g_XMOne);
pand TC_$DIRECTX.MATH_$$_G_XMONE,%xmm3
.Ll325:
# [2920] MOVUPS TXMVECTOR([result]), XMM3// return _mm_castsi128_ps(vTemp);
movups %xmm3,8(%ebp)
.Ll326:
# [2921] end;
leave
ret $8
.Lc129:
.Lt14:
.Ll327:
DIRECTX.MATH_$$_XMVECTORSETBINARYCONSTANT$LONGWORD$LONGWORD$LONGWORD$LONGWORD$$TXMVECTOR:As you see, the output is the same. But most of all, the value in XMM0 is now valid.
.Lc128:
.Ll314:
# [2903] g_vMask1: TXMVECTORU32 = (u: (1, 1, 1, 1));
pushl %ebp
.Lc130:
.Lc131:
movl %esp,%ebp
.Lc132:
# Var C0 located in register eax
# Var C1 located in register edx
# Var C2 located in register ecx
# Var C3 located at ebp+12, size=OS_32
# Var $result located at ebp+8, size=OS_32
.Ll315:
# [2907] movd xmm0,dword ptr [C3]
movd 12(%ebp),%xmm0
.Ll316:
# [2908] movd xmm1,dword ptr[C2]
movd (%ecx),%xmm1
.Ll317:
# [2909] movd xmm2,dword ptr[C1]
movd (%edx),%xmm2
.Ll318:
# [2910] movd xmm3,dword ptr[C0]
movd (%eax),%xmm3
.Ll319:
# [2911] punpckldq xmm3,xmm1
punpckldq %xmm1,%xmm3
.Ll320:
# [2912] punpckldq xmm2,xmm0
punpckldq %xmm0,%xmm2
.Ll321:
# [2913] punpckldq xmm3,xmm2 // XMM3 = vTemp
punpckldq %xmm2,%xmm3
.Ll322:
# [2915] PAND XMM3, [g_vMask1] // vTemp = _mm_and_si128(vTemp,g_vMask1);
pand TC_$DIRECTX.MATH$_$XMVECTORSETBINARYCONSTANT$crcD1D7FBA5_$$_G_VMASK1,%xmm3
.Ll323:
# [2917] PCMPEQD XMM3, [g_vMask1] // vTemp = _mm_cmpeq_epi32(vTemp,g_vMask1);
pcmpeqd TC_$DIRECTX.MATH$_$XMVECTORSETBINARYCONSTANT$crcD1D7FBA5_$$_G_VMASK1,%xmm3
.Ll324:
# [2919] PAND XMM3, [g_XMOne] // vTemp = _mm_and_si128(vTemp,g_XMOne);
pand TC_$DIRECTX.MATH_$$_G_XMONE,%xmm3
.Ll325:
# [2920] MOVUPS TXMVECTOR([result]), XMM3// return _mm_castsi128_ps(vTemp);
movups %xmm3,8(%ebp)
.Ll326:
# [2921] end;
leave
ret $8
.Lc129:
.Lt14:
.Ll327:
DIRECTX.MATH_$$_XMVECTORSETBINARYCONSTANT$LONGWORD$LONGWORD$LONGWORD$LONGWORD$$PXMVECTOR:
.Lc128:
.Ll314:
# [2925] begin
pushl %ebp
.Lc130:
.Lc131:
movl %esp,%ebp
.Lc132:
leal -80(%esp),%esp
# Var C0 located at ebp-16, size=OS_32
# Var C1 located at ebp-32, size=OS_32
# Var C2 located at ebp-48, size=OS_32
# Var C3 located at ebp+8, size=OS_32
# Var $result located at ebp-64, size=OS_32
# Var x located at ebp-80, size=OS_NO
movl %eax,-16(%ebp)
movl %edx,-32(%ebp)
movl %ecx,-48(%ebp)
# CPU PENTIUM
.Ll315:
# [2929] movd xmm0, [C3]
movd 8(%ebp),%xmm0
.Ll316:
# [2930] movd xmm1, [C2]
movd -48(%ebp),%xmm1
.Ll317:
# [2931] movd xmm2, [C1]
movd -32(%ebp),%xmm2
.Ll318:
# [2932] movd xmm3, [C0]
movd -16(%ebp),%xmm3
.Ll319:
# [2933] punpckldq xmm3,xmm1
punpckldq %xmm1,%xmm3
.Ll320:
# [2934] punpckldq xmm2,xmm0
punpckldq %xmm0,%xmm2
.Ll321:
# [2935] punpckldq xmm3,xmm2 // XMM3 = vTemp
punpckldq %xmm2,%xmm3
.Ll322:
# [2937] PAND XMM3, [g_vMask1] // vTemp = _mm_and_si128(vTemp,g_vMask1);
pand TC_$DIRECTX.MATH$_$XMVECTORSETBINARYCONSTANT$LONGWORD$LONGWORD$LONGWORD$LONGWORD$$PXMVECTOR_$$_G_VMASK1,%xmm3
.Ll323:
# [2939] PCMPEQD XMM3, [g_vMask1] // vTemp = _mm_cmpeq_epi32(vTemp,g_vMask1);
pcmpeqd TC_$DIRECTX.MATH$_$XMVECTORSETBINARYCONSTANT$LONGWORD$LONGWORD$LONGWORD$LONGWORD$$PXMVECTOR_$$_G_VMASK1,%xmm3
.Ll324:
# [2941] PAND XMM3, [g_XMOne] // vTemp = _mm_and_si128(vTemp,g_XMOne);
pand TC_$DIRECTX.MATH_$$_G_XMONE,%xmm3
.Ll325:
# [2942] MOVUPS
- , XMM3// return _mm_castsi128_ps(vTemp);
movups %xmm3,-80(%ebp)
# CPU PENTIUM
.Ll326:
# [2944] result:=@x;
leal -80(%ebp),%eax
movl %eax,-64(%ebp)
.Ll327:
# [2945] end;
movl %ebp,%esp
popl %ebp
ret $4
.Lc129:
.Lt14:
.Ll328:
DIRECTX.MATH_$$_XMVECTORSET$SINGLE$SINGLE$SINGLE$SINGLE$$TXMVECTOR:
.Lc261:
.Ll822:
# [5426] asm
pushl %ebp
.Lc263:
.Lc264:
movl %esp,%ebp
.Lc265:
# Var $result located in register eax
# Var x located at ebp+20, size=OS_F32
# Var y located at ebp+16, size=OS_F32
# Var z located at ebp+12, size=OS_F32
# Var w located at ebp+8, size=OS_F32
.Ll823:
# [5427] MOVD XMM0, [w]
movd 8(%ebp),%xmm0
.Ll824:
# [5428] MOVD XMM1, [z]
movd 12(%ebp),%xmm1
.Ll825:
# [5429] MOVD XMM2, [y]
movd 16(%ebp),%xmm2
.Ll826:
# [5430] MOVD XMM3,
movd 20(%ebp),%xmm3
.Ll827:
# [5431] PUNPCKLDQ XMM3,XMM1
punpckldq %xmm1,%xmm3
.Ll828:
# [5432] PUNPCKLDQ XMM2,XMM0
punpckldq %xmm0,%xmm2
.Ll829:
# [5433] PUNPCKLDQ XMM3,XMM2
punpckldq %xmm2,%xmm3
.Ll830:
# [5434] MOVUPS [result], XMM3 // _mm_set_ps( w, z, y, x );
movups %xmm3,(%eax)
.Ll831:
# [5435] end;
leave
ret $16
# Var C0 located in register eax--> not working directly, address of result is on the stack, must be loaded into register
# Var C1 located in register edx
# Var C2 located in register ecx
# Var C3 located at ebp+12, size=OS_32
# Var $result located at ebp+8, size=OS_32
# Var $result located in register eax--> working, cause address of result is located in register
# Var C0 located at ebp+20, size=OS_F32
# Var C1 located at ebp+16, size=OS_F32
# Var C2 located at ebp+12, size=OS_F32
# Var C3 located at ebp+8, size=OS_F32
Note that I also fixed the System V ABI to use the SSE registers properly, so the code that passes the result into the low half of XMM0 and XMM1 will have to be reworked a bit.
What internal error are you getting? I'll see if I can track it down. "vectorcall" should be ignored on *nix64.
It will only pass parameters into the full XMM0 register etc if they are aligned to 16-byte boundaries.
Sorry that it's not quite going to plan.No problems there very willing to help any way I can.
I can explain why the last example is getting passed as a pointer...I only posted simple code, this was surrounded by the usual codealign etc.
To note what parameters are passed into what registers, you'll have to compile a Pascal function that uses vectorcall with a number of vector-like parameters and see how they interact.
I'm not sure if I can do anything about Self always being passed by reference though
I stand corrected on one thing... the System V ABI does support unaligned vectors, unlike vectorcall. I'll see if I can correct that and hence fix your library!
section .text.n_glzvectormath$_$tglzvector2d_$__$$_plus$tglzvector2d$tglzvector2d$$tglzvector2d
.balign 16,0x90
.globl GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_plus$TGLZVECTOR2D$TGLZVECTOR2D$$TGLZVECTOR2D
.type GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_plus$TGLZVECTOR2D$TGLZVECTOR2D$$TGLZVECTOR2D,@function
GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_plus$TGLZVECTOR2D$TGLZVECTOR2D$$TGLZVECTOR2D:
.Lc213:
# Var A located in register rdi
# Var B located in register rsi
# [vectormath_vector2d_unix64_sse_imp.inc]
# [4] asm
# Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 allocated
# [5] movapd xmm0, [A]
movapd (%rdi),%xmm0
# [6] movapd xmm1,
movapd (%rsi),%xmm1
# [7] addpd xmm0, xmm1
addpd %xmm1,%xmm0
# Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 released
# [8] end;
ret
# Register xmm0,xmm1 released
.Lc214:
.Le94:
.size GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_plus$TGLZVECTOR2D$TGLZVECTOR2D$$TGLZVECTOR2D, .Le94 - GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_plus$TGLZVECTOR2D$TGLZVECTOR2D$$TGLZVECTOR2D
GLZVECTORMATH$_$TGLZVECTOR2F_$__$$_plus$TGLZVECTOR2F$TGLZVECTOR2F$$TGLZVECTOR2F:
.Lc102:
# Var A located in register rdi
# Var B located in register rsi
# Var $result located in register xmm0
# [vectormath_vector2f_unix64_sse_imp.inc]
# [4] asm
# Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 allocated
# [5] movq xmm0, [A]
movq (%rdi),%xmm0
# [6] movq xmm1,
movq (%rsi),%xmm1
# [7] addps xmm0, xmm1
addps %xmm1,%xmm0
# Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 released
# [8] end;
ret
# Register xmm0 released
I don't see any fault with the code in this instance, or am I missing something?I checked my code all seems ok
.section .text.n_glzvectormath$_$tglzvector2d_$__$$_length$$double
.balign 16,0x90
.globl GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_LENGTH$$DOUBLE
.type GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_LENGTH$$DOUBLE,@function
GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_LENGTH$$DOUBLE:
.Lc247:
# Var $self located in register rdi
# Var $result located in register xmm0
# [181] asm
# Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 allocated
# [182] movapd xmm0, [RDI]
movapd (%rdi),%xmm0
# [183] mulpd xmm0, xmm0
mulpd %xmm0,%xmm0
# [184] haddpd xmm0, xmm0
haddpd %xmm0,%xmm0
# [187] sqrtsd xmm0, xmm0
sqrtsd %xmm0,%xmm0
# Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 released
# [188] end;
ret
# Register xmm0 released
.Lc248:
.Le111:
.size GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_LENGTH$$DOUBLE, .Le111 - GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_LENGTH$$DOUBLE
.section .text.n_glzvectormath$_$tglzvector2d_$__$$_round$$tglzvector2i
.balign 16,0x90
.globl GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_ROUND$$TGLZVECTOR2I
.type GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_ROUND$$TGLZVECTOR2I,@function
GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_ROUND$$TGLZVECTOR2I:
.Lc257:
# Var $self located in register rdi
# Var $result located in register rax
# [234] asm
# Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 allocated
# [236] movapd xmm0, [RDI]
movapd (%rdi),%xmm0
.section .text.n_glzvectormath$_$tglzvector2d_$__$$_normalize$$tglzvector2d
.balign 16,0x90
.globl GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_NORMALIZE$$TGLZVECTOR2D
.type GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_NORMALIZE$$TGLZVECTOR2D,@function
GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_NORMALIZE$$TGLZVECTOR2D:
.Lc255:
# Var $self located in register rdi
# [223] asm
# Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 allocated
# [224] movapd xmm2, [RDI]
movapd (%rdi),%xmm2
# [225] movapd xmm0, xmm2
movapd %xmm2,%xmm0
# [226] mulpd xmm2, xmm2
mulpd %xmm2,%xmm2
# [227] haddpd xmm2, xmm2
haddpd %xmm2,%xmm2
# [228] sqrtpd xmm2, xmm2
sqrtpd %xmm2,%xmm2
# [229] divpd xmm0, xmm2
divpd %xmm2,%xmm0
# Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 released
# [230] end;
ret
# Register xmm0,xmm1 released
.Lc256:
.Le115:
.size GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_NORMALIZE$$TGLZVECTOR2D, .Le115 - GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_NORMALIZE$$TGLZVECTOR2D
In other news, I have finally fixed the bug with "vectorcall" where it puts Self into RCX instead of RDI on Linux, instead of silently ignoring the Windows-only calling convention. Patch is here: https://bugs.freepascal.org/view.php?id=33542 - sorry it took so long, especially for a surprisingly simple fix.
Note: In FPC 3.0.4, it is definitely split between XMM0 and XMM1 for Linux 64-bit. When FPC 3.1.1 is released, the result for a vector of 2 doubles will likely just be contained within XMM0 and hence your code will require updating.