* * *

Author Topic: AVX and SSE support question  (Read 42005 times)

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #195 on: February 16, 2018, 06:26:23 pm »
Quote
Sorry that it's not quite going to plan.
No problems there very willing to help any way I can.

Trying to get some test for you I started realy simple

Code: Pascal  [Select]
  1. program vectorcall_pd_test1;
  2.  
  3. {$IFNDEF CPUX86_64}
  4.   {$FATAL This test program can only be compiled on Windows or Linux 64-bit with an Intel processor }
  5. {$ENDIF}
  6. {$MODESWITCH ADVANCEDRECORDS}
  7. {$ASMMODE Intel}
  8. type
  9.   { TM128 }
  10.   {$push}
  11.   {$CODEALIGN RECORDMIN=16}
  12.   {$PACKRECORDS C}
  13.   TM128 = record
  14.     public
  15.     class operator +(A, B: TM128): TM128; vectorcall;
  16.     case Byte of
  17.       0: (M128_F32: array[0..3] of Single);
  18.       1: (M128_F64: array[0..1] of Double);
  19.   end;
  20.   {$pop}
  21.  
  22. { TM128 }
  23.  
  24. class operator TM128.+(A, B: TM128): TM128; vectorcall; assembler; nostackframe;
  25. asm
  26.   addps xmm0, xmm1
  27. end;
  28.  
  29. var
  30.   xm1, xm2, xm3: TM128;
  31.  
  32. begin
  33.   xm3 := xm1 + xm2;
  34.  
  35. end.                              
  36.  

And the assembler produced was as good as it could get, with the exception of movdqa  %xmm0,%xmm0

Code: Pascal  [Select]
  1. section .text.n_p$vectorcall_pd_test1$_$tm128_$__$$_plus$tm128$tm128$$tm128
  2.         .balign 16,0x90
  3. .globl  P$VECTORCALL_PD_TEST1$_$TM128_$__$$_plus$TM128$TM128$$TM128
  4.         .type   P$VECTORCALL_PD_TEST1$_$TM128_$__$$_plus$TM128$TM128$$TM128,@function
  5. P$VECTORCALL_PD_TEST1$_$TM128_$__$$_plus$TM128$TM128$$TM128:
  6. .Lc1:
  7. # [vectorcall_hva_test2.pas]
  8. # [29] asm
  9. #  CPU ATHLON64
  10. .Ll1:
  11. # [30] addps xmm0, xmm1
  12.         addps   %xmm1,%xmm0
  13. #  CPU ATHLON64
  14. .Ll2:
  15. # [31] end;
  16.         ret
  17.         # Register xmm0 released
  18. .Lc2:
  19. .Lt2:
  20. .Le0:
  21.  
  22.  [37] xm3 := xm1 + xm2;
  23.         movdqa  U_$P$VECTORCALL_PD_TEST1_$$_XM2(%rip),%xmm1
  24.         # Register xmm0 allocated
  25.         movdqa  U_$P$VECTORCALL_PD_TEST1_$$_XM1(%rip),%xmm0
  26.         call    P$VECTORCALL_PD_TEST1$_$TM128_$__$$_plus$TM128$TM128$$TM128@PLT
  27.         movdqa  %xmm0,%xmm0
  28.         movaps  %xmm0,U_$P$VECTORCALL_PD_TEST1_$$_XM3(%rip)
  29.         # Register xmm0 released
  30.  
  31.  

So it looks like I will have my work cut out to try to get a simple test with the errors. Will let you know when I get something simple enough to submit as a bug.
 Though there is no reference to self here.
« Last Edit: February 16, 2018, 07:15:46 pm by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #196 on: February 16, 2018, 07:05:01 pm »
Next simple test to what happens to self.
Code: Pascal  [Select]
  1.     function Add(A: TM128): TM128; vectorcall;
  2.  

Code: Pascal  [Select]
  1. # [33] xm3 := xm1.Add(xm2);
  2.         movdqa  U_$P$VECTORCALL_PD_TEST1_$$_XM2(%rip),%xmm0
  3.         leaq    U_$P$VECTORCALL_PD_TEST1_$$_XM1(%rip),%rcx
  4.         movaps  %xmm0,%xmm1
  5.         call    P$VECTORCALL_PD_TEST1$_$TM128_$__$$_ADD$TM128$$TM128@PLT
  6.         movdqa  %xmm0,%xmm0
  7.         movaps  %xmm0,U_$P$VECTORCALL_PD_TEST1_$$_XM3(%rip)

So it would appear that Self is passed as a pointer in RCX, However unix convention is Self is passed as pointer in RDI.
First parameter is passed in xmm1, although having gone via xmm0 for some reason.
There appears to be a redundant movdqa after the call.
There is still no info in the body of the function to indicate what parameters are passed in what registers, this has been ascertained only by looking at the usage call not from the function definition where we usually find this information.
Return value is good as there is no split of the 128 into two 64s as in 3.0.4 fpc.

I hope this helps a bit.
« Last Edit: February 16, 2018, 07:12:58 pm by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #197 on: February 16, 2018, 08:57:01 pm »
Small progress on working through possible issues with our code base.

If I declare like this

Code: Pascal  [Select]
  1.   TVector4fType = packed array[0..3] of Single;
  2.   TM128 = record
  3.     public
  4.     class operator +(A, B: TM128): TM128; vectorcall;
  5.     case Byte of
  6.       0: (M128_F32: TVector4fType);
  7.   end;                                                                                            
  8.  

It is not recognised as a vector and parameters are passed as pointers in standard registers. The packed keyword makes no difference if it is present or not. The generated  code then expects 4 singles in xmm0-3 which it uses to populate the result via a series of 4 movss.

This is possibly why it is getting confused by our codebase.
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

CuriousKit

  • Jr. Member
  • **
  • Posts: 74
Re: AVX and SSE support question
« Reply #198 on: February 17, 2018, 01:40:50 am »
I can explain why the last example is getting passed as a pointer... there's nothing to restrict it to a 16-byte boundary, so the compiler has to assume that every variable of that type is unaligned even if it does happen to fall on a 16-byte boundary, hence it's treated as a complex record type.  That's intended behaviour.

I noticed the "movdqa %xmm0,%xmm0" myself during development and wasn't sure what was causing it, but when I submit my next batch of improvements to the peephole optimizer, I'll look out for that one (it already removes references to "mov %eax,%eax", for example).  I'll have to double-check though that the matching ymm or zmm isn't being used, because "movdqa %xmm0,%xmm0" has the effect of zeroing the upper 128 bits of %ymm0 and the upper 384 bits of %zmm0, and hence isn't a null operation.

Passing Self into RCX when it should be RDI is indeed a bug, and I would recommend posting this as an actual bug report.  I'm not sure if I can do anything about Self always being passed by reference though - I think the compiler treats it like an object - what's the generated assembly for a record containing a single integer field?  Moving the 2nd parameter into XMM0 and then into XMM1 looks like a compiler inefficiency in regards to how it allocates temporary registers (do the debug messages say anything about the registers being allocated and released?), and can either be corrected in the peephole optimizer (which does similar things already) or with more advanced Data Flow Analysis (something I'm working on which I named the "Deep Optimizer" before I discovered the official term) -  such a feature will also help to correct the mixing of 'movdqa' and 'movaps', since using the wrong one will incur a performance penalty (you should only use 'movdqa' if you're using the relevant registers for integer operations).

To note what parameters are passed into what registers, you'll have to compile a Pascal function that uses vectorcall with a number of vector-like parameters and see how they interact.  Vectorcall dictates that XMM0 to XMM5 are used for vector/float inputs, HFAs and HVAs, although if there aren't enough free registers to fully contain a homogeneous aggregate (basically, an array of 1 to 4 aligned vectors or floats of the same type), it is wholly passed on the stack, but any vector/float parameters that follow will go into the registers that are left.  Return values are passed through XMM0 to XMM3, although XMM1 to XMM3 are only used if the return type is a homogeneous aggregate.  Integer parameters are passed in the same way as the regular Win64 calling convention dictates (or on Win32, following the rules of 'fastcall').
« Last Edit: February 17, 2018, 01:52:50 am by CuriousKit »

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #199 on: February 17, 2018, 06:57:09 am »
Calling convention bug reported.
https://bugs.freepascal.org/view.php?id=33184

Quote
I can explain why the last example is getting passed as a pointer...
I only posted simple code, this was surrounded by the usual codealign etc.
I was assuming that the bug lay in the possibility that the type declared was not of array of 4 and the compiler did not look at the underlying type to see if it was typecast compatible with array of 4 singles.

One thing I noted when I  roughed what I had learnt from these  tests into main code base was that the usage of the {$PACKRECORDS C} in your tests breaks alignment of any consts declared using this type. Removal of the {$PACKRECORDS C} fixed a whole slew of problems in code such as
Code: Pascal  [Select]
  1. movaps    xmm1, XMMWORD PTR [RIP+cOneVector4f]
which would segfault with it in (the usual indication that a non aligned memory access had occured)

Quote
To note what parameters are passed into what registers, you'll have to compile a Pascal function that uses vectorcall with a number of vector-like parameters and see how they interact.

hmm.... this just makes life more difficult than it was before, doable but not ideal by any means. As shown by the various questions on calling parameter usage already within this thread it is difficult enough already working this stuff out, without having to trawl through trying to find usage.

Not such a real problem for us as we have a well structured test harness where we can work this out but for other users maybe not such a good plan.

Quote
I'm not sure if I can do anything about Self always being passed by reference though

I doubt you could do anything about that without the fpc devs accepting a pure fpc calling convention for this, which I seriously doubt will ever happen. Too many problems in shared libs etc.

So I will park any further testing until we get the RCX/RDI issue resolved and get back to testing our stuff. But in general this is looking promising.

« Last Edit: February 17, 2018, 08:28:32 am by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

CuriousKit

  • Jr. Member
  • **
  • Posts: 74
Re: AVX and SSE support question
« Reply #200 on: February 18, 2018, 12:40:23 am »
I stand corrected on one thing... the System V ABI does support unaligned vectors, unlike vectorcall.  I'll see if I can correct that and hence fix your library!

dicepd

  • Full Member
  • ***
  • Posts: 163
Re: AVX and SSE support question
« Reply #201 on: February 18, 2018, 06:32:19 am »
I stand corrected on one thing... the System V ABI does support unaligned vectors, unlike vectorcall.  I'll see if I can correct that and hence fix your library!

??? nothing in our library uses unaligned vectors. At least in 64bit it does not, we have been quite strict in making sure that all accesses are aligned for performance reasons. Only 32bit uses unaligned assembler variants, and to be honest 32bit is not the  priority as it is much slower because of the fewer registers available and the fact we are forced to use unaligned assembler calls.

32 bit is a limitation of the fpc pascal calling convention, it should be possible to make vectorcall work for all 32bit intel platforms as the calling convention in 32 bit is a pascal defined calling convention if I remember correctly.
« Last Edit: February 18, 2018, 07:00:45 am by dicepd »
Lazarus 1.8rc5 Win64 / Linux gtk2 64 / FreeBSD qt4

BeanzMaster

  • Full Member
  • ***
  • Posts: 145
Re: AVX and SSE support question
« Reply #202 on: March 31, 2018, 08:20:23 pm »
Hi to all i'm currently made some update of vectormath lib (https://github.com/jdelauney/SIMD-VectorMath-UnitTest)

I'm working with double

This is a piece of code
Code: Pascal  [Select]
  1. class operator TGLZVector2d.+(constref A, B: TGLZVector2d): TGLZVector2d; assembler; nostackframe; register;
  2. asm
  3.   movapd xmm0, [A]
  4.   movapd xmm1, [B]
  5.   addpd  xmm0, xmm1
  6. end;  

This code work well under Windows but not under Linux

The strange thing is in the .S file see :

Quote
section .text.n_glzvectormath$_$tglzvector2d_$__$$_plus$tglzvector2d$tglzvector2d$$tglzvector2d
   .balign 16,0x90
.globl   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_plus$TGLZVECTOR2D$TGLZVECTOR2D$$TGLZVECTOR2D
   .type   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_plus$TGLZVECTOR2D$TGLZVECTOR2D$$TGLZVECTOR2D,@function
GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_plus$TGLZVECTOR2D$TGLZVECTOR2D$$TGLZVECTOR2D:
.Lc213:
# Var A located in register rdi
# Var B located in register rsi
# [vectormath_vector2d_unix64_sse_imp.inc]
# [4] asm
   # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 allocated
# [5] movapd xmm0, [A]
   movapd   (%rdi),%xmm0
# [6] movapd xmm1,
   movapd   (%rsi),%xmm1
# [7] addpd  xmm0, xmm1
   addpd   %xmm1,%xmm0
   # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 released
# [8] end;
   ret
   # Register xmm0,xmm1 released
.Lc214:
.Le94:
   .size   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_plus$TGLZVECTOR2D$TGLZVECTOR2D$$TGLZVECTOR2D, .Le94 - GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_plus$TGLZVECTOR2D$TGLZVECTOR2D$$TGLZVECTOR2D

the same with the use of Single type :
Quote
GLZVECTORMATH$_$TGLZVECTOR2F_$__$$_plus$TGLZVECTOR2F$TGLZVECTOR2F$$TGLZVECTOR2F:
.Lc102:
# Var A located in register rdi
# Var B located in register rsi
# Var $result located in register xmm0
# [vectormath_vector2f_unix64_sse_imp.inc]
# [4] asm
   # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 allocated
# [5] movq  xmm0, [A]
   movq   (%rdi),%xmm0
# [6] movq  xmm1,
   movq   (%rsi),%xmm1
# [7] addps xmm0, xmm1
   addps   %xmm1,%xmm0
   # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 released
# [8] end;
   ret
   # Register xmm0 released

Like we see no result allocated for Double. Have you an idea or it is a bug from FPC compiler ?


CuriousKit

  • Jr. Member
  • **
  • Posts: 74
Re: AVX and SSE support question
« Reply #203 on: April 01, 2018, 03:13:31 am »
Under the System V ABI that 64-bit Linux uses, floating-point results of type Single or Double are passed via XMM0.  I don't see any fault with the code in this instance, or am I missing something?

In other news, I have finally fixed the bug with "vectorcall" where it puts Self into RCX instead of RDI on Linux, instead of silently ignoring the Windows-only calling convention.  Patch is here: https://bugs.freepascal.org/view.php?id=33542 - sorry it took so long, especially for a surprisingly simple fix.

BeanzMaster

  • Full Member
  • ***
  • Posts: 145
Re: AVX and SSE support question
« Reply #204 on: April 03, 2018, 10:14:04 pm »
Hi, CuriousKit
Quote
I don't see any fault with the code in this instance, or am I missing something?
I checked my code all seems ok

the strange thing is with function like Length wich is retunr double or round which return a TGLZVector2I all is ok. But with all function with a return type of TGLZVector2d result is not allocated :

Quote
.section .text.n_glzvectormath$_$tglzvector2d_$__$$_length$$double
   .balign 16,0x90
.globl   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_LENGTH$$DOUBLE
   .type   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_LENGTH$$DOUBLE,@function
GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_LENGTH$$DOUBLE:
.Lc247:
# Var $self located in register rdi
# Var $result located in register xmm0
# [181] asm
   # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 allocated
# [182] movapd xmm0, [RDI]
   movapd   (%rdi),%xmm0
# [183] mulpd  xmm0, xmm0
   mulpd   %xmm0,%xmm0
# [184] haddpd xmm0, xmm0
   haddpd   %xmm0,%xmm0
# [187] sqrtsd   xmm0, xmm0
   sqrtsd   %xmm0,%xmm0
   # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 released
# [188] end;
   ret
   # Register xmm0 released
.Lc248:
.Le111:
   .size   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_LENGTH$$DOUBLE, .Le111 - GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_LENGTH$$DOUBLE

.section .text.n_glzvectormath$_$tglzvector2d_$__$$_round$$tglzvector2i
   .balign 16,0x90
.globl   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_ROUND$$TGLZVECTOR2I
   .type   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_ROUND$$TGLZVECTOR2I,@function
GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_ROUND$$TGLZVECTOR2I:
.Lc257:
# Var $self located in register rdi
# Var $result located in register rax
# [234] asm
   # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 allocated
# [236] movapd   xmm0, [RDI]
   movapd   (%rdi),%xmm0

and for example the normalize function

Quote
.section .text.n_glzvectormath$_$tglzvector2d_$__$$_normalize$$tglzvector2d
   .balign 16,0x90
.globl   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_NORMALIZE$$TGLZVECTOR2D
   .type   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_NORMALIZE$$TGLZVECTOR2D,@function
GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_NORMALIZE$$TGLZVECTOR2D:
.Lc255:
# Var $self located in register rdi
# [223] asm
   # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 allocated
# [224] movapd xmm2, [RDI]
   movapd   (%rdi),%xmm2
# [225] movapd xmm0, xmm2
   movapd   %xmm2,%xmm0
# [226] mulpd  xmm2, xmm2
   mulpd   %xmm2,%xmm2
# [227] haddpd xmm2, xmm2
   haddpd   %xmm2,%xmm2
# [228] sqrtpd xmm2, xmm2
   sqrtpd   %xmm2,%xmm2
# [229] divpd  xmm0, xmm2
   divpd   %xmm2,%xmm0
   # Register rax,rcx,rdx,rsi,rdi,r8,r9,r10,r11 released
# [230] end;
   ret
   # Register xmm0,xmm1 released
.Lc256:
.Le115:
   .size   GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_NORMALIZE$$TGLZVECTOR2D, .Le115 - GLZVECTORMATH$_$TGLZVECTOR2D_$__$$_NORMALIZE$$TGLZVECTOR2D

It's not a problem with alignment, all seems correct. It's a silly behaviours. Perhaps i'll must open a bug report

If someone can test in other Linux distro than mine (Manjaro) It will can help

Quote
In other news, I have finally fixed the bug with "vectorcall" where it puts Self into RCX instead of RDI on Linux, instead of silently ignoring the Windows-only calling convention.  Patch is here: https://bugs.freepascal.org/view.php?id=33542 - sorry it took so long, especially for a surprisingly simple fix.

No problem you've already made an awesome job with that ;)

CuriousKit

  • Jr. Member
  • **
  • Posts: 74
Re: AVX and SSE support question
« Reply #205 on: April 06, 2018, 05:16:37 pm »
It might just be a missing comment in the .s file, but under Linux 64-bit and vectorcall, a return vector of 2 doubles is wholly contained within XMM0 (it might be split between XMM0 and XMM1 though if it's not aligned, which is technically incorrect for the System V ABI).  What does the disassembly show when you try to call Normalize and assign the result?
« Last Edit: April 06, 2018, 05:18:30 pm by CuriousKit »

BeanzMaster

  • Full Member
  • ***
  • Posts: 145
Re: AVX and SSE support question
« Reply #206 on: April 06, 2018, 06:38:15 pm »
Hi CK, thanks it s surely the solution. I ll need add a movhlps xmm1, xmm0. I don t take care of that. I m not use Linux often. I checked the "TGLZVector4f" code it's the same issue. I'm currently not at home, i'll check tonight and tell you 👍
Thanks

BeanzMaster

  • Full Member
  • ***
  • Posts: 145
Re: AVX and SSE support question
« Reply #207 on: April 06, 2018, 10:05:57 pm »
All test are green now thanks again CuriousKit  8-)

CuriousKit

  • Jr. Member
  • **
  • Posts: 74
Re: AVX and SSE support question
« Reply #208 on: April 06, 2018, 10:50:34 pm »
No problem at all.

Note: In FPC 3.0.4, it is definitely split between XMM0 and XMM1 for Linux 64-bit.  When FPC 3.1.1 is released, the result for a vector of 2 doubles will likely just be contained within XMM0 and hence your code will require updating.
« Last Edit: April 06, 2018, 10:53:31 pm by CuriousKit »

BeanzMaster

  • Full Member
  • ***
  • Posts: 145
Re: AVX and SSE support question
« Reply #209 on: April 06, 2018, 11:45:52 pm »
Quote
Note: In FPC 3.0.4, it is definitely split between XMM0 and XMM1 for Linux 64-bit.  When FPC 3.1.1 is released, the result for a vector of 2 doubles will likely just be contained within XMM0 and hence your code will require updating.

Yes i've made some test from Trunk  under windows it's promising (not with vectorcall yet). But FPC 3.1.1 it's clearly better point of view performances

 

Recent

Get Lazarus at SourceForge.net. Fast, secure and Free Open Source software downloads Open Hub project report for Lazarus