Recent

Author Topic: Vectorcall and records  (Read 1664 times)

Madoc

  • Jr. Member
  • **
  • Posts: 52
Vectorcall and records
« on: November 25, 2023, 01:40:50 pm »
It appears that if I define a vector type like this:


Code: Pascal  [Select][+][-]
  1. TVec4f = record
  2.     case integer of
  3.       0: (v: array[0..3] of single);
  4.       1: (xyz: array[0..2] of single);
  5.       2: (xy: TA2f; zw: array[0..1] of single);
  6.       3: (x,y,z,w: single);
  7.     end;

The vectorcall convention does not work with this as a parameter. Am I doing something wrong? Is there a workaround? Is this behaviour due to change?

Keep in mind that the above is just an example. The actual type has a bunch of operators and functions and the embedded vector array types are predefined with their own mechanics, this is just to illustrate the problem. Simply passing "v" to a function won't do.

Thanks

Madoc

  • Jr. Member
  • **
  • Posts: 52
Re: Vectorcall and records
« Reply #1 on: November 25, 2023, 02:00:09 pm »
I've noticed another possibly more serious issue. If I declare a simple function like this:
Code: Pascal  [Select][+][-]
  1. type
  2.   TA3f = array[0..2] of single;
  3.  
  4. function Sub(const a, b: TA3f): TA3f; assembler; nostackframe; vectorcall;
  5. asm
  6.           subps    xmm0, xmm1
  7. end;

I would expect a and b to be passed in in xmm0 and xmm1, and the result to be returned in xmm0, but instead of being properly loaded into these registers it seems that this happens instead:

Code: Pascal  [Select][+][-]
  1. movq xmm3, [rax]
  2. movq xmm4, [rax+8]
  3. ...
  4. movq xmm1, [rax]
  5. movq xmm2, [rax+8]
  6.  
And the result is not using xmm registers.

Obviously a 3 component vector is not recognised so I'm seeing this weird 2-1, 2-1 split and not result, which is also using registers 1-4 instead of 0-3.

Is this intended behaviour?

jamie

  • Hero Member
  • *****
  • Posts: 6787
Re: Vectorcall and records
« Reply #2 on: November 25, 2023, 05:10:52 pm »
using 3.2.2 I don't see the same.

The call order here is:
0,4,1,2 of the xmm? regs.

and 0 is moved to 3 before the call.

and RCX is being loaded with a stack address, not sure about that, maybe that is to be used as the return address?


EDIT:
 as for the RECORD issue, you are correct, the compiler simply passes the addresses of the records via a standard Register.
however, on return, the Xmmx registers are being used to set a record return type.

 strange, maybe this should be reported?


« Last Edit: November 25, 2023, 05:40:37 pm by jamie »
The only true wisdom is knowing you know nothing

Madoc

  • Jr. Member
  • **
  • Posts: 52
Re: Vectorcall and records
« Reply #3 on: November 25, 2023, 05:40:39 pm »
Sounds the same to me. You're also getting 2 vectors in xmm1-4 instead of 0-1. The code before the call should look something like this:
Code: Pascal  [Select][+][-]
  1.         movq     xmm0, [rax]
  2.         movss    xmm1, [rax+8]
  3.         movlhps  xmm0, xmm1
  4.         ...
  5.         movq     xmm1, [rax]
  6.         movss    xmm2, [rax+8]
  7.         movlhps  xmm1, xmm2
  8.  

And obviously the result should be expected in xmm0. If the vectors don't remain resident in registers between calls then the calling convention is pointless. That's the optimisation.

Once the register is done with it should be written back to memory with the same split addresses, using a different load and store pattern won't hit the same cache lines and can incur a massive performance hit.
Code: Pascal  [Select][+][-]
  1.         movhlps  xmm1, xmm0
  2.         movq     [rax], xmm0
  3.         movss    [rax+8], xmm1
  4.  

jamie

  • Hero Member
  • *****
  • Posts: 6787
Re: Vectorcall and records
« Reply #4 on: November 25, 2023, 05:49:20 pm »
I don't think the compiler is fully compatible with Delphi or at least buggy for some cases.
The only true wisdom is knowing you know nothing

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11984
  • FPC developer.
Re: Vectorcall and records
« Reply #5 on: November 25, 2023, 05:57:27 pm »
Do you have an example of partial XMM vector use in Delphi? If so, which version?

I'm still trying to accelerate the well known Nils Haeck FFT unit.

Madoc

  • Jr. Member
  • **
  • Posts: 52
Re: Vectorcall and records
« Reply #6 on: November 25, 2023, 06:46:15 pm »
What does this have to do with Delphi?

So I also tried the code from this example:

https://gitlab.com/freepascal.org/fpc/source/-/blob/main/tests/test/cg/tvectorcall3.pp

My compilation refuses to use the vectorcall convention at all, so the result of AddVectorsAsm is garbage. My colleague however is able to compile an apparently functional version of this on his machine. We can't find any real difference in how our projects or Lazarus are set up. If I remove {$CODEALIGN RECORDMIN=16} I get the same weird split vector behaviour I did in my own code. Either way nothing is working on my end.

What could be affecting these things?

jamie

  • Hero Member
  • *****
  • Posts: 6787
Re: Vectorcall and records
« Reply #7 on: November 25, 2023, 07:02:44 pm »

Quote
What does this have to do with Delphi?


This
https://forum.lazarus.freepascal.org/index.php/topic,65336.0.html

The only true wisdom is knowing you know nothing

Madoc

  • Jr. Member
  • **
  • Posts: 52
Re: Vectorcall and records
« Reply #8 on: November 25, 2023, 07:26:58 pm »
Quote
What does this have to do with Delphi?
This
https://forum.lazarus.freepascal.org/index.php/topic,65336.0.html

I still don't understand. I'm currently using Lazarus and FPC, not Delphi.

Anyway, I've done a bunch more tests and what I've found is that only this exact configuration seems to work (for me):
Code: Pascal  [Select][+][-]
  1. {$push}
  2. {$CODEALIGN RECORDMIN=16}
  3. {$PACKRECORDS C}
  4. type
  5.   TM128 = record
  6.     case Byte of
  7.       0: (M128_F32: array[0..3] of Single);
  8.       1: (M128_F64: array[0..1] of Double);
  9.   end;
  10. {$pop}
  11.  
  12. TVector4f = packed record
  13.     case Byte of
  14.       0: (M128: TM128);
  15.       1: (X, Y, Z, W: Single);
  16.   end;
  17.  
If I deviate from this exact setup in any way it either stops using vectorcall entirely or does the weird split. For example:

Code: Pascal  [Select][+][-]
  1. {$push}
  2. {$CODEALIGN RECORDMIN=16}
  3. {$PACKRECORDS C}
  4. type
  5.   TM128: array[0..3] of Single);
  6. {$pop}
  7.  

Will stop using vectorcall at all.

And this:
Code: Pascal  [Select][+][-]
  1. TVector4f = packed record
  2.     case Byte of
  3.       0: (M128: TM128);
  4.       1: (X, Y, Z, W: Single);
  5.       2: (v: array[0..3] of Single);
  6.   end;
  7.  

Will result in the weird split register behaviour from before.

My colleague was able to use the first setup of substituting TM128 for an array for some reason, but got similar results with everything else.

Once again these are just examples, point is I can't figure out how to make real practical record types function. I'm not sure what's going on, and I can't find any documentation on this, but it seems too fragile and just impossible to use.

Madoc

  • Jr. Member
  • **
  • Posts: 52
Re: Vectorcall and records
« Reply #9 on: November 26, 2023, 10:51:41 pm »
Do you have an example of partial XMM vector use in Delphi? If so, which version?

I'm still trying to accelerate the well known Nils Haeck FFT unit.

I'm sorry, I didn't realise you're a developer and I didn't understand your question. Last I knew Delphi still doesn't support vectorcall, I've been eyeing FPC for a while, but support for vectorcall is what made jump in. I get around Delphi's lack of modern features and optimisation by writing a whole big bunch of assembler. Vectorcall could really simplify my code while boosting performance at the same time. If you guys manage to add inlining of assembler functions too (I got a hint this might be in sight) FPC would become truly top tier for my kind of usage.

I'm still not sure what your question was, but if I happen to know anything that might help you I'm more than happy to share.

runewalsh

  • Jr. Member
  • **
  • Posts: 85
Re: Vectorcall and records
« Reply #10 on: November 27, 2023, 06:09:02 am »
Offtopic, but simply don’t use vectorcall, your vectorized routines should be large and non-trivial enough to be absolutely unaffected by things like transferring parameters to/from memory, otherwise your vectorization won’t do much even with vectorcall. Eg: have not Add(vec4, vec4) → vec4 but BatchAdd(pvec4, pvec4, pvec4, count);, and store your data accordingly.

Madoc

  • Jr. Member
  • **
  • Posts: 52
Re: Vectorcall and records
« Reply #11 on: November 27, 2023, 10:04:19 am »
Offtopic, but simply don’t use vectorcall, your vectorized routines should be large and non-trivial enough to be absolutely unaffected by things like transferring parameters to/from memory, otherwise your vectorization won’t do much even with vectorcall. Eg: have not Add(vec4, vec4) → vec4 but BatchAdd(pvec4, pvec4, pvec4, count);, and store your data accordingly.

Of course I write larger routines when I can, but my general case isn't crunching lots of data with simple operations, but doing lots of complex operations on small amounts of data. Ironically in your example you'd be reducing call overhead, not memory access, and vectorcall is not relevant at all.

Writing hand optimised vector operators and functions (stuff like matrix multiplies, cross products etc.) gives me a massive increase in performance regardless. This was true even in the FPU days, but with SSE it's even more valid, especially in delphi or FPC which don't do automatic vectorisation like most C++ compilers these days. For pascal today, you're much better off writing vector functions for even the most basic operations.

Vectorcall in these situations can provide a very big increase in performance and partly makes up for the lack of automatic vectorisation too, as long as the compiler is smart enough to keep things in xmm registers rather than moving to memory between calls.

 

TinyPortal © 2005-2018