Hello, I'm new here. I hope it's okay for me to make this kind of post on this forum.
One morning I stumble across FPC release notes and I see "vectorcall". Within 5 minutes I'm informing my colleagues that we're porting everything over to FPC / Lazarus. Yes.
I've been tinkering with FPC for about a week now and everything looks good, but vectorcall doesn't actually work in practice. There's a couple of reason for this:
1) Support for 3 component vectorsRegardless of everything else, this is the really critical one. Real world applications tend to use 3D vectors, whether it's XYZ or RGB. These are currently not supported by FPC's implementation of vectorcall.
These should be loaded and stored like this:
// Load from memory
movq xmm0, qword [rcx] // Load XY00
movss xmm1, dword [rcx+8] // Load Z000
movlhps xmm0, xmm1 // Combine to XYZ0
// Write to memory
movhlps xmm1, xmm0 // Copy Z0
movq qword [rcx], xmm0 // Store XY
movss dword [rcx+8], xmm1 // Store Z
Any combination of xmm registers can be used of course. The resulting register will contain XYZ0.
Once a 3D vector is loaded into a register, it can be considered a 4D vector until it needs to be written back to "real memory" and faster 4 component moves should be used:
movaps xmm1, xmm0 // Copy from xmm0 to xmm1
movaps dqword [rsp], xmm0 // 16 byte aligned copy to stack (preferable)
movups dqword [rsp], xmm0 // Unaligned copy to stack
3D vectors are not only extremely common, but need the most help from vectorcall as loading and storing them is particularly slow.
Consider a line like this (vectors with class operators):
intersection := a + (b - a) * ((threshold - ta) / (tb - ta));
This generates a whole lot of very slow code. With vectorcall this would become much faster and more efficient.
2) Record compatibilityEven if only requirement 1 is met, I'm ready to go (albeit with far clunkier code), but it would obviously be very good if vectorcall could be used with record parameters and operators. Here's an example of such a data layout:
TArray3f = packed array [0..2] of single;
TVec3f = packed record
case integer of
0: (v: TArray3f);
1: (xy: TVec2f);
2: (x,y,z: single);
end;
This should be recognised as a vector, but as it is even a 2 or 4 component equivalent is not and vectorcall will not work producing errors in code attempting to use it.
I'm not sure what your design philosophy would be for something like this, or if there's something I haven't considered, but presumably vectorcall should just always work based on size of data alone (8, 12, 16 bytes). Alternatively a compiler directive applied to the record could simplify things.
That's the important stuff. But if inlining vectorcall functions could also be a thing, we'd essentially have the ability to write our own high level optimised intrinsics with no overhead, and have blazingly fast class operators. This would be incredible and really put FPC on the map.
I think that's all I have for now. I really hope at least some of this can be implemented. I would be implementing and maintaining exhaustive and highly optimised vector functions (I know some tricks) which I'd be happy to share with this community. I'm not sure if I can otherwise be of help, but I can try.
Thanks!