Recent

Author Topic: Useful optimizations for a video game project  (Read 3743 times)

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9855
  • Debugger - SynEdit - and more
    • wiki
Re: Useful optimizations for a video game project
« Reply #15 on: June 23, 2022, 06:34:08 pm »
So here is an example for the 32 byte alignment

"foo" has whatever alignment it gets by surrounding code. Also, its loop is offset by the code in front of it.
It takes 4000 ms (on my PC:  I7-8700)

Then the loop at exactly 32 byte aligned: 3640 ms (almost 10% faster)
The loop with an offset of 32+8 also is fast => so relevant code inside the loop must have just hit the right alignment.

The loop that is intentionally 32+16 takes 4000.

So (on modern CPU), just adding the right align can make a noticeable diff.

And since functions are aligned at 16 bytes, it depends on where the previous function ended. And be sometime fast, and sometime not.
Which also means, if you benchmark, and you change code in one place, then code in another place may be re-aligned, and be faster or slower. Your total benchmark then may change more by the accidental align change, than by the change you tried to measure.

See https://lists.freepascal.org/pipermail/fpc-devel/2022-January/044336.html
Includes a very interesting video presentation on the topic


Code: Text  [Select][+][-]
  1. 4000
  2. 4016
  3.  
  4. 3640
  5. 3625
  6.  
  7. 3610
  8. 3625
  9.  
  10. 4015
  11. 4016
  12.  

Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   {$IFDEF UNIX}
  7.   cthreads,
  8.   {$ENDIF}
  9.   Classes, SysUtils
  10.   { you can add units after this };
  11.  
  12. {$R *.res}
  13.  
  14. const
  15.   N = 150*1024*1024;
  16. var
  17.   a, b, c: array of byte;
  18.  
  19. procedure foo;
  20. var
  21.   i: Integer;
  22. begin
  23.   c[0] := (a[0] + b[0]) div 2;
  24.  
  25.   for i := 1 to N-1 do begin
  26.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  27.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  28.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  29.   end;
  30. end;
  31.  
  32. procedure foo2;
  33. var
  34.   i: Integer;
  35. begin
  36.   c[0] := (a[0] + b[0]) div 2;
  37.  
  38.   asm
  39.   .align 32
  40.   end;
  41.  
  42.   for i := 1 to N-1 do begin
  43.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  44.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  45.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  46.   end;
  47. end;
  48.  
  49. procedure foo3;
  50. var
  51.   i: Integer;
  52. begin
  53.   c[0] := (a[0] + b[0]) div 2;
  54.  
  55.   asm
  56.   .align 32
  57.   nop
  58.   nop
  59.   nop
  60.   nop
  61.   nop
  62.   nop
  63.   nop
  64.   nop
  65.   end;
  66.  
  67.   for i := 1 to N-1 do begin
  68.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  69.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  70.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  71.   end;
  72. end;
  73.  
  74. procedure foo4;
  75. var
  76.   i: Integer;
  77. begin
  78.   c[0] := (a[0] + b[0]) div 2;
  79.  
  80.   asm
  81.   .align 32
  82.   nop
  83.   nop
  84.   nop
  85.   nop
  86.   nop
  87.   nop
  88.   nop
  89.   nop
  90.   nop
  91.   nop
  92.   nop
  93.   nop
  94.   nop
  95.   nop
  96.   nop
  97.   nop
  98.   end;
  99.  
  100.   for i := 1 to N-1 do begin
  101.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  102.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  103.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  104.   end;
  105. end;
  106.  
  107. var
  108.   t: QWord;
  109.   i: Integer;
  110. begin
  111.   SetLength(a, N);
  112.   SetLength(b, N);
  113.   SetLength(c, N);
  114.   for i := 0 to N-1 do begin
  115.     a[i] := Random(255);
  116.     b[i] := Random(255);
  117.   end;
  118.  
  119.  
  120.   t := GetTickCount64;
  121.   foo;
  122.   t := GetTickCount64 -t;
  123.   writeln(t);
  124.  
  125.   t := GetTickCount64;
  126.   foo;
  127.   t := GetTickCount64 -t;
  128.   writeln(t);
  129.  
  130.  
  131.   t := GetTickCount64;
  132.   foo2;
  133.   t := GetTickCount64 -t;
  134.   writeln(t);
  135.  
  136.   t := GetTickCount64;
  137.   foo2;
  138.   t := GetTickCount64 -t;
  139.   writeln(t);
  140.  
  141.  
  142.   t := GetTickCount64;
  143.   foo3;
  144.   t := GetTickCount64 -t;
  145.   writeln(t);
  146.  
  147.   t := GetTickCount64;
  148.   foo3;
  149.   t := GetTickCount64 -t;
  150.   writeln(t);
  151.  
  152.  
  153.   t := GetTickCount64;
  154.   foo4;
  155.   t := GetTickCount64 -t;
  156.   writeln(t);
  157.  
  158.   t := GetTickCount64;
  159.   foo4;
  160.   t := GetTickCount64 -t;
  161.   writeln(t);
  162.  
  163.  
  164.   readln;
  165. end.
  166.  

furious programming

  • Hero Member
  • *****
  • Posts: 858
Re: Useful optimizations for a video game project
« Reply #16 on: June 23, 2022, 09:52:18 pm »
Thank you very much for the example. I will definitely check this trick in the future.

But I just tested your test program on my Intel® Core™ i7-640LM (which is quite old) and I can't reproduce your results. Aligned code is slightly faster in the debug build mode (generated in the project options window), below are the results:

Code: Pascal  [Select][+][-]
  1. 9203
  2. 9125
  3. 8860
  4. 9031
  5. 8969
  6. 9000
  7. 9312
  8. 9250

but in the release mode (also generated by the Lazarus), there is no gain — aligned code is actually slower than not aligned:

Code: Pascal  [Select][+][-]
  1. 1671
  2. 1657
  3. 1906
  4. 1906
  5. 1891
  6. 1906
  7. 1891
  8. 1890

It looks like the optimizations itself are giving the best performance in this case.
« Last Edit: June 23, 2022, 10:09:33 pm by furious programming »
Lazarus 3.2 with FPC 3.2.2, Windows 10 — all 64-bit

Working solo on an acrade, action/adventure game in retro style (pixelart), programming the engine and shell from scratch, using Free Pascal and SDL. Release planned in 2026.

PascalDragon

  • Hero Member
  • *****
  • Posts: 5462
  • Compiler Developer
Re: Useful optimizations for a video game project
« Reply #17 on: June 24, 2022, 09:04:09 am »
Quote
I would assume that you'll need some thight kernel here programmed in asm which fully uses AVX2 / AVX512 (or comparable) capabilities to get somewhere.

I can always use calculations only on integers (as in the good old days), because high precision of calculations will not be required — after all, the image will be highly pixelated. But there will be time for that.

SIMD instruction sets are not restricted to floating point values, but can be used with integers as well. Thus if you have multiple, equivalent integer operations that can be done in parallel (e.g. adding a vector) you can utilize SIMD.

BrunoK

  • Sr. Member
  • ****
  • Posts: 452
  • Retired programmer
Re: Useful optimizations for a video game project
« Reply #18 on: June 24, 2022, 09:58:38 am »
Thank you very much for the example. I will definitely check this trick in the future.

But I just tested your test program on my Intel® Core™ i7-640LM (which is quite old) and I can't reproduce your results.
I can't either reproduce the results.

11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz   2.42 GHz (Laptop) :
Code: Pascal  [Select][+][-]
  1. C:\fpc-laz\fpc\3.2.2-git\bin\i386-win32\ppc386.exe
  2. -MObjFPC
  3. -Scghi
  4. -O1
  5. -gw2
  6. -godwarfsets
  7. -gl
  8. -l
  9. -vewnhibq
  10. -Filib\i386-win32
  11. -Fu.
  12. -FUlib\i386-win32
  13. -FE.
  14. -opgmSpeedTest.exe
  15. -OoREGVAR
Compiling -O1 -OoREGVAR gives very satisfactory speed and reasonable debugging.
Times for I386 and x86_64 are very similar.

Timings win10 FPC 3.2.2 i386 :
719
734
1063
1062
1079
1062
1078
1063
Trying to align code seems to be counterproductive for -O1 -OoREGVAR

What is strange is that my times are lower than those of Martin on a fairly low range laptop (and also my desktop).

furious programming

  • Hero Member
  • *****
  • Posts: 858
Re: Useful optimizations for a video game project
« Reply #19 on: June 24, 2022, 10:46:50 am »
Thus if you have multiple, equivalent integer operations that can be done in parallel (e.g. adding a vector) you can utilize SIMD.

This is the reason why the use of SIMD will not be possible — there will not be many same operations to be performed in parallel. And even if I wanted to process the frame in this way, the whole process would be much more complicated and much more difficult to implement than generating pixel by pixel separately.

The initial idea is to use a thread pool where each thread handles one ray and uses it to generate the target color of only one pixel. When the thread is done, it gets another pixel to generate — all the way to the end of the frame. After all, the buffer is streamed to the SDL texture — this is the only (and in my case very convenient, by the way) solution, as SDL does not support multi-threaded rendering.



What is strange is that my times are lower than those of Martin on a fairly low range laptop (and also my desktop).

We do not know what optimizations Martin used, although I assume he was the default. Therefore, both your laptop and my (8-year-old Lenovo X201 Tablet) give better performance results. But that's not important — the important thing is that manual code alignment doesn't give us any profit with strong optimizations used (or at least not always). I will have to be more interested in this topic and just check with the right code what the performance will look like with and without code alignment.
Lazarus 3.2 with FPC 3.2.2, Windows 10 — all 64-bit

Working solo on an acrade, action/adventure game in retro style (pixelart), programming the engine and shell from scratch, using Free Pascal and SDL. Release planned in 2026.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9855
  • Debugger - SynEdit - and more
    • wiki
Re: Useful optimizations for a video game project
« Reply #20 on: June 24, 2022, 11:19:16 am »
What is strange is that my times are lower than those of Martin on a fairly low range laptop (and also my desktop).
It seems, while I did -O3 (which afaik includes -Or), I also left other stuff at defaults. Mainly -Criot - that takes time.

About the speed diff => I think the presence of asm code can affect the optimizer.
So that example did not (fully) show my point.

Actually, in my original example ignoring the first (non-asm) routine, I got 2 diff timings in routines with diff alignment.
Removing -Criot, I no longer get that diff => the code is maybe to simple for the cpu.

But (in the mail thread that I linked), I did have an example. At that time, I also found documentation that mentioned the alignment effect.

Paul_

  • Full Member
  • ***
  • Posts: 143
Re: Useful optimizations for a video game project
« Reply #21 on: August 02, 2022, 05:03:46 pm »
Just wondering what type of game it is?

furious programming

  • Hero Member
  • *****
  • Posts: 858
Re: Useful optimizations for a video game project
« Reply #22 on: August 02, 2022, 11:04:52 pm »
Just wondering what type of game it is?

It will be an action/adventure game, with mechanics and projection similar to The Legend of Zelda: A Link to the Past (1991, SNES), but much more extensive, with much nicer graphics (using low-resolution pixelart and special filters) and with couch co-op mode. PCs are thousands of times more powerful than the SNES, so there are practically no limits and I can extend it as much as I want.

I am currently working on the foundations of the game, i.e. window programming and video modes, and an advanced input mapping. Then I will take care of fonts and create controls for the UI of the game (something like mini-LCL). And then I will take care of the engine, that is, in 2-3 months. I hope to have a working prototype of the engine by the end of this year.
« Last Edit: August 02, 2022, 11:08:04 pm by furious programming »
Lazarus 3.2 with FPC 3.2.2, Windows 10 — all 64-bit

Working solo on an acrade, action/adventure game in retro style (pixelart), programming the engine and shell from scratch, using Free Pascal and SDL. Release planned in 2026.

 

TinyPortal © 2005-2018