This is just an observation on performance of trivial copy operations.
I noticed in a small throw-away program that the performance of operations with small records (in my case complex numbers, using unit complex) is unexpectedly poor. The reason is that the unit provides the elementary operations (+, - ,* etc.) as function/operators, and the compiler always makes a copy of the result using REP MOVSL. This is not the case in procedural versions which are much faster, some factor 3 including the time needed for the floating point calculations. So,
Var C1,C2,C3 : complex ;
begin
C3 := C1 + C2;
//is convenient but MUCH slower than the somewhat more cumbersome
complex_add (C1,C2,C3);
This may not necessarily be specific to FPC. I have seen such procedural code in other languages as well, presumably for performance reasons.
Similarly for simple assignment / copy operations, copying a small structure as a whole is much slower than copying its individual elements, both for records and arrays. This can be factor 100.
C2 := C1;
//is much slower than
C2.re := C1.re;
C3.im := C1.im;
That triggered my interest so I made a small benchmark program just copying a small static array of qwords or doubles, trying:
- Whole array copy A2 := A1;
- System.move ()
- Copy all elements in a loop
- Copy each element individually (without loop)
Attached a chart of the results, time elapsed for 200.000.000 array copy operations, FPC3.2.2., i386-win32, optimisation Level2, on a CPU 12th Gen Intel Core i5-1245U. Summarizing:
For a structure containing more than a single qword, A2 := A1 is by far the slowest version. The compiler does the copy using REP MOVSL which appears to be really inefficient on small structures. Surprisingly, it becomes faster with increasing size, even in absolute time, although the assembler code remains the same. If the structure is only one qword, the compiler is smart enough to use VMOVSD instead of REP MOVSL.
System.move is much better but obviously needs some overhead, hence not optimal for small structures.
The elementwise copy in a loop is even better at small structures but quickly becomes inefficient with increasing structure size.
The fastest solution for small structures is obviously to copy each element individually, avoiding a loop. For records like type complex containing two doubles (16 bytes), this is the optimal solution. The compiler uses VMOVSD for copying a qword on COREAVX processor; otherwise MOV, which is factor 2 slower while still blazingly fast in comparison to REP MOVSL.
As far as I am concerned, I made my own complex unit, using procedures/methods rather than operators. When copying, I copy real part and imag part separately to avoid the REP MOVSL penalty. This is a bit awkward, and I would rather overload the := operator but this is not possible because it already exists. I could alternatively write a procedure clone (const a: complex; var b: complex) or similar instead.
I wonder if replacing the REP MOVSL by MOV / VMOVSD for small structures would by a worthwhile and feasible optimisation to the compiler. I guess the problem is that it would have to deal with other block sizes than multiples of 8 bytes, and be compatible with packing, bitpacking, alignment etc.