I can explain why the last example is getting passed as a pointer... there's nothing to restrict it to a 16-byte boundary, so the compiler has to assume that every variable of that type is unaligned even if it does happen to fall on a 16-byte boundary, hence it's treated as a complex record type. That's intended behaviour.
I noticed the "movdqa %xmm0,%xmm0" myself during development and wasn't sure what was causing it, but when I submit my next batch of improvements to the peephole optimizer, I'll look out for that one (it already removes references to "mov %eax,%eax", for example). I'll have to double-check though that the matching ymm or zmm isn't being used, because "movdqa %xmm0,%xmm0" has the effect of zeroing the upper 128 bits of %ymm0 and the upper 384 bits of %zmm0, and hence isn't a null operation.
Passing Self into RCX when it should be RDI is indeed a bug, and I would recommend posting this as an actual bug report. I'm not sure if I can do anything about Self always being passed by reference though - I think the compiler treats it like an object - what's the generated assembly for a record containing a single integer field? Moving the 2nd parameter into XMM0 and then into XMM1 looks like a compiler inefficiency in regards to how it allocates temporary registers (do the debug messages say anything about the registers being allocated and released?), and can either be corrected in the peephole optimizer (which does similar things already) or with more advanced Data Flow Analysis (something I'm working on which I named the "Deep Optimizer" before I discovered the official term) - such a feature will also help to correct the mixing of 'movdqa' and 'movaps', since using the wrong one will incur a performance penalty (you should only use 'movdqa' if you're using the relevant registers for integer operations).
To note what parameters are passed into what registers, you'll have to compile a Pascal function that uses vectorcall with a number of vector-like parameters and see how they interact. Vectorcall dictates that XMM0 to XMM5 are used for vector/float inputs, HFAs and HVAs, although if there aren't enough free registers to fully contain a homogeneous aggregate (basically, an array of 1 to 4 aligned vectors or floats of the same type), it is wholly passed on the stack, but any vector/float parameters that follow will go into the registers that are left. Return values are passed through XMM0 to XMM3, although XMM1 to XMM3 are only used if the return type is a homogeneous aggregate. Integer parameters are passed in the same way as the regular Win64 calling convention dictates (or on Win32, following the rules of 'fastcall').