Recent

Author Topic: Slow copying of small structures  (Read 1945 times)

Nitorami

  • Hero Member
  • *****
  • Posts: 505
Slow copying of small structures
« on: April 17, 2024, 03:30:31 pm »
This is just an observation on performance of trivial copy operations.

I noticed in a small throw-away program that the performance of operations with small records (in my case complex numbers, using unit complex) is unexpectedly poor. The reason is that the unit provides the elementary operations (+, - ,* etc.) as function/operators, and the compiler always makes a copy of the result using REP MOVSL. This is not the case in procedural versions which are much faster, some factor 3 including the time needed for the floating point calculations. So,
Code: Pascal  [Select][+][-]
  1. Var C1,C2,C3 : complex ;
  2. begin
  3.   C3 := C1 + C2;
  4. //is convenient but MUCH slower than the somewhat more cumbersome
  5.   complex_add (C1,C2,C3);
  6.  

This may not necessarily be specific to FPC. I have seen such procedural code in other languages as well, presumably for performance reasons.

Similarly for simple assignment / copy operations, copying a small structure as a whole is much slower than copying its individual elements, both for records and arrays. This can be factor 100.
Code: Pascal  [Select][+][-]
  1. C2 := C1;
  2. //is much slower than
  3. C2.re := C1.re;
  4. C3.im := C1.im;
  5.  
That triggered my interest so I made a small benchmark program just copying a small static array of qwords or doubles, trying:
  • Whole array copy A2 := A1;
  • System.move ()
  • Copy all elements in a loop
  • Copy each element individually (without loop)
Attached a chart of the results, time elapsed for 200.000.000 array copy operations, FPC3.2.2., i386-win32, optimisation Level2, on a CPU 12th Gen Intel Core i5-1245U. Summarizing:

For a structure containing more than a single qword, A2 := A1 is by far the slowest version. The compiler does the copy using REP MOVSL which appears to be really inefficient on small structures. Surprisingly, it becomes faster with increasing size, even in absolute time, although the assembler code remains the same. If the structure is only one qword, the compiler is smart enough to use VMOVSD instead of REP MOVSL.

System.move is much better but obviously needs some overhead, hence not optimal for small structures.

The elementwise copy in a loop is even better at small structures but quickly becomes inefficient with increasing structure size.

The fastest solution for small structures is obviously to copy each element individually, avoiding a loop. For records like type complex containing two doubles (16 bytes), this is the optimal solution. The compiler uses VMOVSD for copying a qword on COREAVX processor; otherwise MOV, which is factor 2 slower while still blazingly fast in comparison to REP MOVSL.

As far as I am concerned, I made my own complex unit, using procedures/methods rather than operators. When copying, I copy real part and imag part separately to avoid the REP MOVSL penalty. This is a bit awkward, and I would rather overload the := operator but this is not possible because it already exists. I could alternatively write a procedure clone (const a: complex; var b: complex) or similar instead.

I wonder if replacing the REP MOVSL by MOV / VMOVSD for small structures would by a worthwhile and feasible optimisation to the compiler. I guess the problem is that it would have to deal with other block sizes than multiples of 8 bytes, and be compatible with packing, bitpacking, alignment etc.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11519
  • FPC developer.
Re: Slow copying of small structures
« Reply #1 on: April 17, 2024, 04:10:03 pm »


rep movsl speed is CPU dependent, search for "fast short rep mov"

But afaik Intel has it since 10th gen. Still it would be good to do this analysis on more CPU types.

Thaddy

  • Hero Member
  • *****
  • Posts: 14595
  • Sensorship about opinions does not belong here.
Re: Slow copying of small structures
« Reply #2 on: April 17, 2024, 04:59:26 pm »
It is not so much the mov, but the rep. And indeed, it depends on the processor. rep has always been relatively slow on intel hardware. What I do not understand is that the compiler actually uses it in its code generation. That rep issue has always been there since the 8088/8086, probably since 4004, family. AMD's have traditionally been much faster. So it is a very old issue.
I am a bit bemused for two reasons:
1. that the compiler uses it
2. that this is a recurring issue that did not occur in the past 35 years or so.

I used to use rep back in the late 80's until a collegue alerted me that that was not such a good idea: Sweitze de Vries at Cyco software. Always about micro optimizations with no real impact.
And yes, "we used all the profilers in the world".
« Last Edit: April 17, 2024, 05:20:32 pm by Thaddy »
bitrate is always calculated like this:sample rate * bitdepth * number of channels.

alpine

  • Hero Member
  • *****
  • Posts: 1110
Re: Slow copying of small structures
« Reply #3 on: April 17, 2024, 06:10:14 pm »
This may not necessarily be specific to FPC. I have seen such procedural code in other languages as well, presumably for performance reasons.

Similarly for simple assignment / copy operations, copying a small structure as a whole is much slower than copying its individual elements, both for records and arrays. This can be factor 100.
Code: Pascal  [Select][+][-]
  1. C2 := C1;
  2. //is much slower than
  3. C2.re := C1.re;
  4. C3.im := C1.im;
  5.  
Is it all about copying small structures or it is for operator overloading? 

Because for the latter, procedural code will be more efficient for the reason of being not dependent on the infix notation in which the operators were used. The infix notation presumes creation of temporaries, thus involving copying for the assignments. The compiler knows nothing about the properties of the user defined type so it cannot afford any optimizations and works in the most straightforward way.

Into the procedural code there is no such considerations.

BTW you can trick the compiler by defining an additional type and a dedicated assignment operator:
Code: Pascal  [Select][+][-]
  1. type
  2.   float_type = single;
  3.  
  4.   complex = record
  5.     re : float_type;
  6.     im : float_type;
  7.   end;
  8.  
  9.   complex_ext = type complex;
  10.  
  11.   operator * (const r : float_type; constref z1 : complex) z : complex_ext; inline;
  12.   begin
  13.     z.re := z1.re * r; z.im := z1.im * r;
  14.   end;
  15.  
  16.   operator * (constref z1, z2 : complex) z : complex_ext; inline;
  17.   begin
  18.     z.re := (z1.re * z2.re) - (z1.im * z2.im);
  19.     z.im := (z1.re * z2.im) + (z1.im * z2.re);
  20.   end;
  21.  
  22.   operator * (constref z1 : complex; const r : float_type) z : complex_ext; inline;
  23.   begin
  24.     z.re := z1.re * r; z.im := z1.im * r;
  25.   end;
  26.  
  27.   operator := (constref z2 : complex_ext) z : complex; inline;
  28.   begin
  29.      z.re := z2.re;
  30.      z.im := z2.im;
  31.   end;
  32.  
  33. var
  34.   R : real;
  35.   C, Z : complex;
  36.  
  37. begin
  38.   R := 1.0;
  39.   C.re := 1.0;
  40.   C.im := 0.0;
  41.   Z.re := 1.0;
  42.   Z.im := 0;
  43.   C := R * (R*Z) * 1.0;
  44.   C := R * R*Z * 1.0;
  45.   C := 1.0 * R*Z;
  46.   C := 1.0 * R * R*Z;
  47. end.
« Last Edit: April 17, 2024, 06:18:46 pm by alpine »
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

Thaddy

  • Hero Member
  • *****
  • Posts: 14595
  • Sensorship about opinions does not belong here.
Re: Slow copying of small structures
« Reply #4 on: April 17, 2024, 06:37:29 pm »
Is it all about copying small structures or it is for operator overloading?
No.
bitrate is always calculated like this:sample rate * bitdepth * number of channels.

jamie

  • Hero Member
  • *****
  • Posts: 6221
Re: Slow copying of small structures
« Reply #5 on: April 18, 2024, 12:37:04 am »
It takes a few steps to setup a REP mov like that. You need to load the xSI, xDI, xCX  and initate the REP  xmov

Those steps alone most likely take more than a simple two steps of moving an 8 byte value via registers.

 SO the slowness you are seeing on small objects is most likely due to the initial setup of that operation.
 
 of course, on larger fields, once setup, your savings of using the REP ... should show.

Maybe the compiler should take into account those steps and use a series of register moves that would equal the same time as the initial loading of a small chunk or less.


The only true wisdom is knowing you know nothing

Nitorami

  • Hero Member
  • *****
  • Posts: 505
Re: Slow copying of small structures
« Reply #6 on: April 18, 2024, 09:43:56 am »
Thank you, that all makes sense.
@Alpine: It is about all operations on small structures where the compiler inserts a REP MOVSL, as in an operator or function, or when copying. Thanks for the hint with the hard types to cheat the compiler, that could be useful. In the matter at hand it won't help because the := operator is still an operator with infix notation, and the compiler will insert its REP MOVSL no matter what (inlining it does not change this).

@marcov: I repeated the test on my ageing budget PC, an AMD A4-5300 APU with Radeon HD Graphics. Here is the result.
The difference between REP MOVSL and VMOVSD is not as grim as on the Intel, but still factor 10. On the old AMD, the REP MOVSL with small chunk sizes is even faster than on my fairly new Intel notebook.
Apart from that, the issues are similar, up to chunk sizes of at least 128 bytes, REP MOVSL is the slowest way to copy structures, and the compiler should avoid it.
« Last Edit: April 18, 2024, 10:48:37 am by Nitorami »

alpine

  • Hero Member
  • *****
  • Posts: 1110
Re: Slow copying of small structures
« Reply #7 on: April 18, 2024, 11:25:10 am »
Thank you, that all makes sense.
@Alpine: It is about all operations on small structures where the compiler inserts a REP MOVSL, as in an operator or function, or when copying. Thanks for the hint with the hard types to cheat the compiler, that could be useful. In the matter at hand it won't help because the := operator is still an operator with infix notation, and the compiler will insert its REP MOVSL no matter what (inlining it does not change this).
By crafting the expression (see my last 4 lines) you can actually remove the REPs, but that's not the point. Generally, the infix operator notation will always require temporaries, hence operators are suboptimal. And fpc lacks copy constructors, static objects, etc. it's a pity.

Optimizing the copying of structures will be fantastic, but in the context of operators I doubt it will change the situation with regard to the procedures.
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

Nitorami

  • Hero Member
  • *****
  • Posts: 505
Re: Slow copying of small structures
« Reply #8 on: April 18, 2024, 03:40:18 pm »
Ok the temporaries are necessary but this would not be of concern if copying them wasn't so slow. The reason why the compiler inserts no REP MOVSL in your code is, I believe, because you use single. When changing to double, the REP MOVSL are back.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11519
  • FPC developer.
Re: Slow copying of small structures
« Reply #9 on: April 18, 2024, 03:53:00 pm »

By crafting the expression (see my last 4 lines) you can actually remove the REPs, but that's not the point. Generally, the infix operator notation will always require temporaries, hence operators are suboptimal. 

For simple types those temporaries could be optimized out when inlined

alpine

  • Hero Member
  • *****
  • Posts: 1110
Re: Slow copying of small structures
« Reply #10 on: April 18, 2024, 04:15:44 pm »
Ok the temporaries are necessary but this would not be of concern if copying them wasn't so slow. The reason why the compiler inserts no REP MOVSL in your code is, I believe, because you use single. When changing to double, the REP MOVSL are back.
True, that's why I put the float_type - to experiment with different types.

My point is - while the small structure copying can be optimized, and that will be great by itself, the operators will stay suboptimal compared to the procedures, since the procedures won't require copying at all. They'll evaluate the calculations in correct order using references, probably without the need for intermediates. 
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

Nitorami

  • Hero Member
  • *****
  • Posts: 505
Re: Slow copying of small structures
« Reply #11 on: April 18, 2024, 04:41:52 pm »
@Alpine: yes, understood.
@Marcov: I do not know if it is theoretically possible to optimise out the temporaries and the REP MOVSL. The compiler currently does not do it, even with inlining. Only for structures up to 8 bytes in size, VMOVSD is used instead of REP MOVSL. Operations with complex number of type single are therefore significantly faster than when using double.

 

TinyPortal © 2005-2018