Recent

Author Topic: Default and speed effect  (Read 3056 times)

Okoba

  • Hero Member
  • *****
  • Posts: 528
Default and speed effect
« on: May 30, 2023, 02:44:27 pm »
Hello,
I faced an interesting issue, and I can not understand why this happens. Hopefully someone with a better understanding of the compiler and assembly language could clarify.


In this code, there is assing routing that copies a record to another.
The test procedure performs that function, and that is simple too.


The interesting part is that if I use A:= Default(TTest); beforehand, it speeds up the could by a good margin, near 2x.
Using A.V := nil;  does not change anything, so I don't know what magic Default does to the code that results in such a difference.

I know this is a micro benchmark, but I faced such sudden changes in speed, and I can not understand why?

Lazarus: Lazarus 2.3.0 (rev main-2_3-2865-g5895004fde) FPC 3.3.1 x86_64-win64-win32/win64

Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. uses
  4.   SysUtils;
  5.  
  6. type
  7.   TTest = record
  8.     V: array of QWord;
  9.   end;
  10.  
  11.   procedure Assign(var A: TTest; const B: TTest); inline;
  12.   begin
  13.     if Length(A.V) <> Length(B.V) then
  14.       SetLength(A.V, Length(B.V));
  15.     if Length(B.V) <> 0 then
  16.       Move(B.V[0], A.V[0], Length(B.V) * SizeOf(QWord));
  17.   end;
  18.  
  19.   procedure Test;
  20.   var
  21.     T: QWord;
  22.     I: Int64;
  23.     A, B: TTest;
  24.   begin
  25.     //Commenting this increases the time from 180 to 330
  26.     //A := Default(TTest);
  27.  
  28.    //A.V := nil;  //This once does not change the result
  29.  
  30.     SetLength(B.V, 1);
  31.  
  32.     T := GetTickCount64;
  33.     for I := 1 to 100 * 1000 * 1000 do
  34.       Assign(A, B);
  35.     WriteLn(GetTickCount64 - T);
  36.   end;
  37.  
  38. begin
  39.   Test;
  40.   ReadLn;
  41. end.            

« Last Edit: May 30, 2023, 03:35:57 pm by Okoba »

Nitorami

  • Sr. Member
  • ****
  • Posts: 481
Re: Default and speed effect
« Reply #1 on: May 30, 2023, 03:34:08 pm »
As you say, it is a microbenchmark, and I cannot reproduce it here. But I had similar effects, factor 2 simply by defining an empty string somewhere in the program. I guess not even the processor manufacturers know in detail how all their optimisation mechanisms operate together and which effect they have in the individual case.
BTW your assign function is wrong, move expects number of bytes while your record length is in qwords.

Okoba

  • Hero Member
  • *****
  • Posts: 528
Re: Default and speed effect
« Reply #2 on: May 30, 2023, 03:37:11 pm »
Thank you I fixed the mistake.
As you said, it happens in your case too. And in my real case, it is not a micro benchmark, I just isolated the code to have a sample to share here.

Blaazen

  • Hero Member
  • *****
  • Posts: 3237
  • POKE 54296,15
    • Eye-Candy Controls
Re: Default and speed effect
« Reply #3 on: May 30, 2023, 05:35:59 pm »
This may be coincidence caused by code aligning (generated asm is aligned to some nice value like 32B or so).
Lazarus 2.3.0 (rev main-2_3-2863...) FPC 3.3.1 x86_64-linux-qt Chakra, Qt 4.8.7/5.13.2, Plasma 5.17.3
Lazarus 1.8.2 r57369 FPC 3.0.4 i386-win32-win32/win64 Wine 3.21

Try Eye-Candy Controls: https://sourceforge.net/projects/eccontrols/files/

Okoba

  • Hero Member
  • *****
  • Posts: 528
Re: Default and speed effect
« Reply #4 on: May 30, 2023, 05:37:21 pm »
This happens a lot for me, so I like to have more info about it and maybe have way to nudge FPC to the better direction.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9754
  • Debugger - SynEdit - and more
    • wiki
Re: Default and speed effect
« Reply #5 on: May 30, 2023, 06:07:57 pm »
This may be coincidence caused by code aligning (generated asm is aligned to some nice value like 32B or so).

My thought too. Especially because it moves the for loop. And that is a For loop that is rather short.
On Intel there is a cache for translated micro ops, which use 32 bit bounds, and if the loop ends up fitting -> wooooom.

However that cache isn't everything. Modern CPU do a lot of different optimizations. If you try to go all out for one of them, you may loose on another one.

https://www.youtube.com/watch?v=r-TLSBdHe1A



Okoba

  • Hero Member
  • *****
  • Posts: 528
Re: Default and speed effect
« Reply #6 on: May 30, 2023, 06:13:23 pm »
Thank you Martin. And thanks for the link.
So the Default is doing nothing of benefit? Only the result value allocates a memory that makes the alignment better?
Is there a low level way to find out about this and use it or know more about it? As I saw this alignment issue  (adding a var or a function result) makes a difference. It is usable for very delicate functions (like filling a memory or finding something).

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9754
  • Debugger - SynEdit - and more
    • wiki
Re: Default and speed effect
« Reply #7 on: May 30, 2023, 06:22:00 pm »
Thank you Martin. And thanks for the link.
So the Default is doing nothing of benefit? Only the result value allocates a memory that makes the alignment better?
Is there a low level way to find out about this and use it or know more about it? As I saw this alignment issue  (adding a var or a function result) makes a difference. It is usable for very delicate functions (like filling a memory or finding something).

You can check the alignment in the assembler win of the debugger.

But, it wont work magically for every loop. It requires very specific conditions (and may even vary depending on the CPU generation on which it runs).
There is a $Align directive, which afaik can also be used for loops.

I have once (and once only) seen this on a longer loop. Probably because a specific part in the loop did get on the sweet spot.
Otherwise very short loops, with very high iteration counts are probably likely to benefit.

Like your test loop
- did likely fit into 32 bytes (max 64),
- and then if also started at a 32 byte bounds,
- and doing massive iterations....
- **And** additionally working repeatedly on the same data A and B => so the data was in cache already.
==> That is quality food for the CPUs internal optimizer. But that does not really happen exactly like this in real life. (yet the effect can happen in real life too)

But I am not an expert, so I may be wrong with my above assumptions.

Okoba

  • Hero Member
  • *****
  • Posts: 528
Re: Default and speed effect
« Reply #8 on: May 31, 2023, 09:15:35 am »
Thank you for the explanation.
I know nothing about assembly, but maybe I should, if I like to use these optimization. My first thought was, maybe there is something in my code, that I can change.

Okoba

  • Hero Member
  • *****
  • Posts: 528
Re: Default and speed effect
« Reply #9 on: May 31, 2023, 09:32:31 am »
Just to be clear, what does default() do that makes the code change?
And is removing it, changes any behaviour in the sample code? Is it redundant and I just seeing a side effect of it in the outputted assemenly?

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9754
  • Debugger - SynEdit - and more
    • wiki
Re: Default and speed effect
« Reply #10 on: May 31, 2023, 11:20:00 am »
Just to be clear, what does default() do that makes the code change?
And is removing it, changes any behaviour in the sample code? Is it redundant and I just seeing a side effect of it in the outputted assemenly?

Default assigns a value to the (entire) record.
- In this case the value is zero / nil.
- In this case the record and the field in it are the same thing (so the record could have padding)

So in this case it does the same change as the ":= nil" => but it may use different assembler code. And that may mean the code occupies a different count of bytes. And subsequent code gets shifted. => changing alignment (in this case of the loop) => having effects on how fast it runs.




Your CPU does a lot more than you expect.

"mov rax, [address]" (move data between mem and register)
The CPU produces a whole program from that. So called micro-ops. And those get executed. That translation can be cached. And that can have an effect on the speed.

But caching at bounds is not all.
Sometimes a list of steps can be changed in order, still meaning the same (e.g. "A:=1; b:=2;" can be swapped). That also happens sometimes with conditions and other specific ops.
Depending on the order, the speed may differ (And no, don't try to reorder your Pascal code => that wont do no good. This is about the assembler level. You can't affect that)
Because, if in the correct order, your CPU can do them in parallel. But only if in the correct order.

And there are way more things affecting speed.

What you can affect, is in which order to access data in memory. And this can have an impact on cache usage (data/memory cache).

And looking at algorithms used: Keyword "Big O"




Then again => Is that code actually worth the optimization?

How often is it called? How much percentage time of your app does it take? If you speed it up by 5% (in real life, not in an artificial benchmark), how much does your app gain?

Because lets pretend that code makes it to 10% of the runtime of your app. And you speed it up by 5% => Your app will be only 0.5% faster.







440bx

  • Hero Member
  • *****
  • Posts: 3921
Re: Default and speed effect
« Reply #11 on: May 31, 2023, 11:43:39 am »
I know nothing about assembly, but maybe I should, if I like to use these optimization. My first thought was, maybe there is something in my code, that I can change.
I just want to add to, and reinforce, what @Martin_fr has mentioned.

The first thing to understand is that program optimization is a tricky little "critter".  For instance, optimizing a parallel algorithm is usually very different from optimizing a single threaded algorithm for the same purpose.   

After that, the algorithm's Big O is a crucial characteristic. 

After that, come CPU-friendly optimizations, e.g, avoiding cache misses, keeping execution units running and a very long list of "arcane" details that are CPU architecture dependent.  Knowing assembler _and_ the CPU architecture is required for making these optimizations.

A programmer should decide first, if the algorithm is going to be parallel or single threaded, then choose an algorithm based on it's Big O (usually affects ease of implementation), then the assembly code generated by the compiler.    More often than not, the first two are where a programmer's efforts are concentrated.  The last one is rarely justified in applications programming but common when writing an O/S.

I always recommend a programmer learn, at least, some assembler and, it wouldn't hurt any for a programmer to read either intel's or AMD's CPU's documentation (particularly the system software manual)

HTH.
 
(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

Okoba

  • Hero Member
  • *****
  • Posts: 528
Re: Default and speed effect
« Reply #12 on: May 31, 2023, 01:34:18 pm »
Thank you very much Martin and 440bx.
Those were a good read and code alignment is an interesting point.

Okoba

  • Hero Member
  • *****
  • Posts: 528
Re: Default and speed effect
« Reply #13 on: May 31, 2023, 01:37:44 pm »
I was working on the sample and found another problem. I watched the assembly code and it does not seem a alignment problem. The outputted assembly has Initialize and FPC_COPY on every loop step, and it is near 10X slower.
Why adding the first Default out of the loop, effects the loop in such manner?
Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. uses
  4.   SysUtils;
  5.  
  6. type
  7.   TTest = record
  8.     V: array of QWord;
  9.   end;
  10.  
  11.   procedure Assign(var A: TTest; const B: TTest); inline;
  12.   begin
  13.     if Length(A.V) <> Length(B.V) then
  14.       SetLength(A.V, Length(B.V));
  15.     if Length(B.V) <> 0 then
  16.       Move(B.V[0], A.V[0], Length(B.V) * SizeOf(QWord));
  17.   end;
  18.  
  19.   procedure Assign(var A: TTest; const B: QWord); inline;
  20.   begin
  21.     if Length(A.V) <> 1 then
  22.       SetLength(A.V, 1);
  23.     A.V[0] := B;
  24.   end;
  25.  
  26.   operator := (const AValue: QWord): TTest; //inline;
  27.   begin
  28.     Assign(Result, AValue);
  29.   end;
  30.  
  31.   procedure Test;
  32.   var
  33.     T: QWord;
  34.     I: Int64;
  35.     A: TTest;
  36.   begin
  37.     //A := Default(TTest);
  38.  
  39.     T := GetTickCount64;
  40.     for I := 1 to 1000 * 1000 * 1000 do
  41.       //Assign(A, I); //1140ms with and without Default
  42.       A := I; //1900ms with Default and 15500 without (near 10X slower)
  43.     WriteLn(GetTickCount64 - T);
  44.   end;
  45.  
  46. begin
  47.   Test;
  48.   ReadLn;
  49. end.

ASerge

  • Hero Member
  • *****
  • Posts: 2212
Re: Default and speed effect
« Reply #14 on: May 31, 2023, 04:29:06 pm »
Why adding the first Default out of the loop, effects the loop in such manner?
The implementation of the default value, as well as auto-assignment when declaring (for management type), is implemented through the implicit declaration of another variable with a given value and then its assignment.
In this case, the compiler, seeing that the variable is already bound somewhere, makes a safe assignment. To do this, a temporary variable is created, an assignment goes into it, and then it is copied via fpc_copy_proc to our variable.
If there are no references to the variable, the compiler optimizes and does not create a temporary variable, avoiding calling the fpc_copy_proc.

 

TinyPortal © 2005-2018