Recent

Author Topic: Better Vectorcall Support  (Read 2204 times)

Madoc

  • Jr. Member
  • **
  • Posts: 52
Better Vectorcall Support
« on: November 30, 2023, 08:58:36 am »
Hello, I'm new here. I hope it's okay for me to make this kind of post on this forum.

One morning I stumble across FPC release notes and I see "vectorcall". Within 5 minutes I'm informing my colleagues that we're porting everything over to FPC / Lazarus. Yes.

I've been tinkering with FPC for about a week now and everything looks good, but vectorcall doesn't actually work in practice. There's a couple of reason for this:


1) Support for 3 component vectors

Regardless of everything else, this is the really critical one. Real world applications tend to use 3D vectors, whether it's XYZ or RGB. These are currently not supported by FPC's implementation of vectorcall.

These should be loaded and stored like this:
Code: Pascal  [Select][+][-]
  1.         // Load from memory
  2.         movq    xmm0, qword [rcx]         // Load XY00
  3.         movss   xmm1, dword [rcx+8]       // Load Z000
  4.         movlhps xmm0, xmm1                // Combine to XYZ0
  5.  
  6.         // Write to memory
  7.         movhlps  xmm1, xmm0               // Copy Z0
  8.         movq     qword [rcx], xmm0        // Store XY
  9.         movss    dword [rcx+8], xmm1      // Store Z
  10.  
Any combination of xmm registers can be used of course. The resulting register will contain XYZ0.

Once a 3D vector is loaded into a register, it can be considered a 4D vector until it needs to be written back to "real memory" and faster 4 component moves should be used:
Code: Pascal  [Select][+][-]
  1.         movaps  xmm1, xmm0              // Copy from xmm0 to xmm1
  2.  
  3.         movaps  dqword [rsp], xmm0      // 16 byte aligned copy to stack (preferable)
  4.  
  5.         movups  dqword [rsp], xmm0      // Unaligned copy to stack
  6.  

3D vectors are not only extremely common, but need the most help from vectorcall as loading and storing them is particularly slow.

Consider a line like this (vectors with class operators):

   intersection := a + (b - a) * ((threshold - ta) / (tb - ta));

This generates a whole lot of very slow code. With vectorcall this would become much faster and more efficient.



2) Record compatibility

Even if only requirement 1 is met, I'm ready to go (albeit with far clunkier code), but it would obviously be very good if vectorcall could be used with record parameters and operators. Here's an example of such a data layout:
Code: Pascal  [Select][+][-]
  1.   TArray3f = packed array [0..2] of single;
  2.  
  3.   TVec3f = packed record
  4.     case integer of
  5.       0: (v: TArray3f);
  6.       1: (xy: TVec2f);
  7.       2: (x,y,z: single);
  8.     end;
  9.  
This should be recognised as a vector, but as it is even a 2 or 4 component equivalent is not and vectorcall will not work producing errors in code attempting to use it.

I'm not sure what your design philosophy would be for something like this, or if there's something I haven't considered, but presumably vectorcall should just always work based on size of data alone (8, 12, 16 bytes). Alternatively a compiler directive applied to the record could simplify things.



That's the important stuff. But if inlining vectorcall functions could also be a thing, we'd essentially have the ability to write our own high level optimised intrinsics with no overhead, and have blazingly fast class operators. This would be incredible and really put FPC on the map.

I think that's all I have for now. I really hope at least some of this can be implemented. I would be implementing and maintaining exhaustive and highly optimised vector functions (I know some tricks) which I'd be happy to share with this community. I'm not sure if I can otherwise be of help, but I can try.

Thanks!

Thaddy

  • Hero Member
  • *****
  • Posts: 16200
  • Censorship about opinions does not belong here.
Re: Better Vectorcall Support
« Reply #1 on: November 30, 2023, 11:25:51 am »
That is all a bit Intel centered and is not applicable to other CPU types like ARM which also supports vector math natively. Avctually better than Intel....
If I smell bad code it usually is bad code and that includes my own code.

Madoc

  • Jr. Member
  • **
  • Posts: 52
Re: Better Vectorcall Support
« Reply #2 on: November 30, 2023, 02:09:07 pm »
That is all a bit Intel centered and is not applicable to other CPU types like ARM which also supports vector math natively. Avctually better than Intel....

I don't have experience with ARM so can't help you there.

Personally I can't see myself taking an interest in ARM in the near future and getting involved in that aspect of this. Apple represents the widest adoption and they decided to drop support for Vulkan, so I decided I'm going to completely ignore that their platform exists.

Either way, there is supposedly support for vectorcall in FPC for x86, but in its current state it's not really usable. I would assume the same issues apply to ARM, I don't know what the machine code implementation there would be, but it looks to me that only a fairly small change is needed to how FPC handles vector registers now.

Nitorami

  • Hero Member
  • *****
  • Posts: 507
Re: Better Vectorcall Support
« Reply #3 on: November 30, 2023, 08:38:46 pm »
Maybe you should better ask such questions in the mailing list ? You'll probably see more developers there

WayneSherman

  • Sr. Member
  • ****
  • Posts: 250
Re: Better Vectorcall Support
« Reply #4 on: December 11, 2023, 02:52:53 am »
Maybe you should better ask such questions in the mailing list ? You'll probably see more developers there

fpc-devel -- FPC developers' list
Sign up and see archived messages here:  https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Previous threads discussing vectorcall:
https://www.google.com/search?q=site%3Alists.freepascal.org%2Fpipermail%2Ffpc-devel%2F+vectorcall

Code, Issues, Merge Requests, Commits, and Comments which make reference to "vectorcall"
https://gitlab.com/search?group_id=12463123&project_id=28644964&repository_ref=main&search=vectorcall&scope=notes
(open the side bar using the search icon on the top left to see each results category)

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11950
  • FPC developer.
Re: Better Vectorcall Support
« Reply #5 on: December 11, 2023, 09:39:31 am »
If I look at the Microsoft explanation of vectorcall, I see arrays of vector types, but NOT of primitive types.

https://learn.microsoft.com/en-us/cpp/cpp/vectorcall?view=msvc-170

Madoc

  • Jr. Member
  • **
  • Posts: 52
Re: Better Vectorcall Support
« Reply #6 on: December 11, 2023, 07:55:27 pm »
If I look at the Microsoft explanation of vectorcall, I see arrays of vector types, but NOT of primitive types.

https://learn.microsoft.com/en-us/cpp/cpp/vectorcall?view=msvc-170

Considering the state of unions and strict aliasing in C++ I'm not surprised, but for Pascal it seems like a really great fit. This kind of structure and its applications with operators is what would benefit from vectorcall the most. If we had some way to identify or specify these structures as vectorcall compatible we could write very fast and very elegant code.

I'm keeping my expectations in check, and even if we only get vectorcall support for 3 component vectors I'll be very very happy given the potential performance improvements.

PascalDragon

  • Hero Member
  • *****
  • Posts: 5764
  • Compiler Developer
Re: Better Vectorcall Support
« Reply #7 on: December 14, 2023, 10:05:47 pm »
I'm keeping my expectations in check, and even if we only get vectorcall support for 3 component vectors I'll be very very happy given the potential performance improvements.

The vectorcall calling convention is supposed to be compatible to MSVC. So unless MSVC passes something differently than FPC (which would be a bug that should be reported with an example) there will be no changes/extensions.

Madoc

  • Jr. Member
  • **
  • Posts: 52
Re: Better Vectorcall Support
« Reply #8 on: December 15, 2023, 05:23:59 pm »
The vectorcall calling convention is supposed to be compatible to MSVC. So unless MSVC passes something differently than FPC (which would be a bug that should be reported with an example) there will be no changes/extensions.

I'm not sure what the goal here is. A calling convention is only concerned with what registers it uses, it doesn't care about their contents and it most certainly isn't concerned with what language specific syntax was originally used to describe the data. There is no direct type compatibility here between the languages. MSVC only seems to work with __m types which don't exist in FPC, and FPC only works with static array types which don't exist in C++.

I've seen FPC functionally passing 2 and 4 component vectors (defined as array[0..x] of single), but with 3 component vectors it splits them into a 2D vector and value (xy00, z000), which I'm pretty sure isn't documented or used by anyone anywhere. I would put such vectors into one register as xyz0, the compiler is currently literally just missing the one movhlps to do this. It makes perfect sense and clearly isn't going to break anything.


The rest is whatever really, but FPC is its own language with its own way of describing types, a simple register compatibility could benefit it enormously, in an area where it is currently falling far behind with a lack of automatic vectorisation and intrinsics. It would in fact make it superior to either of those solutions in my opinion and play to the language's strengths.

It seems to me that someone needs to decide what the role of vectorcall in FPC should be, because as it is it's neither here nor there and doesn't have much benefit. My suggestions wouldn't break anything, and the advantages are huge and obvious. This would be a small change allowing for extremely fast and extremely elegant code.

 

TinyPortal © 2005-2018