Forum > FPC development

Better Vectorcall Support

(1/2) > >>

Hello, I'm new here. I hope it's okay for me to make this kind of post on this forum.

One morning I stumble across FPC release notes and I see "vectorcall". Within 5 minutes I'm informing my colleagues that we're porting everything over to FPC / Lazarus. Yes.

I've been tinkering with FPC for about a week now and everything looks good, but vectorcall doesn't actually work in practice. There's a couple of reason for this:

1) Support for 3 component vectors

Regardless of everything else, this is the really critical one. Real world applications tend to use 3D vectors, whether it's XYZ or RGB. These are currently not supported by FPC's implementation of vectorcall.

These should be loaded and stored like this:

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---        // Load from memory        movq    xmm0, qword [rcx]         // Load XY00        movss   xmm1, dword [rcx+8]       // Load Z000        movlhps xmm0, xmm1                // Combine to XYZ0         // Write to memory        movhlps  xmm1, xmm0               // Copy Z0        movq     qword [rcx], xmm0        // Store XY        movss    dword [rcx+8], xmm1      // Store Z Any combination of xmm registers can be used of course. The resulting register will contain XYZ0.

Once a 3D vector is loaded into a register, it can be considered a 4D vector until it needs to be written back to "real memory" and faster 4 component moves should be used:

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---        movaps  xmm1, xmm0              // Copy from xmm0 to xmm1         movaps  dqword [rsp], xmm0      // 16 byte aligned copy to stack (preferable)         movups  dqword [rsp], xmm0      // Unaligned copy to stack
3D vectors are not only extremely common, but need the most help from vectorcall as loading and storing them is particularly slow.

Consider a line like this (vectors with class operators):

intersection := a + (b - a) * ((threshold - ta) / (tb - ta));

This generates a whole lot of very slow code. With vectorcall this would become much faster and more efficient.

2) Record compatibility

Even if only requirement 1 is met, I'm ready to go (albeit with far clunkier code), but it would obviously be very good if vectorcall could be used with record parameters and operators. Here's an example of such a data layout:

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---  TArray3f = packed array [0..2] of single;   TVec3f = packed record    case integer of      0: (v: TArray3f);      1: (xy: TVec2f);      2: (x,y,z: single);    end; This should be recognised as a vector, but as it is even a 2 or 4 component equivalent is not and vectorcall will not work producing errors in code attempting to use it.

I'm not sure what your design philosophy would be for something like this, or if there's something I haven't considered, but presumably vectorcall should just always work based on size of data alone (8, 12, 16 bytes). Alternatively a compiler directive applied to the record could simplify things.

That's the important stuff. But if inlining vectorcall functions could also be a thing, we'd essentially have the ability to write our own high level optimised intrinsics with no overhead, and have blazingly fast class operators. This would be incredible and really put FPC on the map.

I think that's all I have for now. I really hope at least some of this can be implemented. I would be implementing and maintaining exhaustive and highly optimised vector functions (I know some tricks) which I'd be happy to share with this community. I'm not sure if I can otherwise be of help, but I can try.

Thanks!

That is all a bit Intel centered and is not applicable to other CPU types like ARM which also supports vector math natively. Avctually better than Intel....

--- Quote from: Thaddy on November 30, 2023, 11:25:51 am ---That is all a bit Intel centered and is not applicable to other CPU types like ARM which also supports vector math natively. Avctually better than Intel....

--- End quote ---

Personally I can't see myself taking an interest in ARM in the near future and getting involved in that aspect of this. Apple represents the widest adoption and they decided to drop support for Vulkan, so I decided I'm going to completely ignore that their platform exists.

Either way, there is supposedly support for vectorcall in FPC for x86, but in its current state it's not really usable. I would assume the same issues apply to ARM, I don't know what the machine code implementation there would be, but it looks to me that only a fairly small change is needed to how FPC handles vector registers now.

Nitorami:
Maybe you should better ask such questions in the mailing list ? You'll probably see more developers there

WayneSherman:

--- Quote from: Nitorami on November 30, 2023, 08:38:46 pm ---Maybe you should better ask such questions in the mailing list ? You'll probably see more developers there

--- End quote ---

fpc-devel -- FPC developers' list