### Bookstore

 Computer Math and Games in Pascal (preview) Lazarus Handbook

### Author Topic: Useful optimizations for a video game project  (Read 2101 times)

#### Martin_fr

• Hero Member
• Posts: 8483
• Debugger - SynEdit - and more
##### Re: Useful optimizations for a video game project
« Reply #15 on: June 23, 2022, 06:34:08 pm »
So here is an example for the 32 byte alignment

"foo" has whatever alignment it gets by surrounding code. Also, its loop is offset by the code in front of it.
It takes 4000 ms (on my PC:  I7-8700)

Then the loop at exactly 32 byte aligned: 3640 ms (almost 10% faster)
The loop with an offset of 32+8 also is fast => so relevant code inside the loop must have just hit the right alignment.

The loop that is intentionally 32+16 takes 4000.

So (on modern CPU), just adding the right align can make a noticeable diff.

And since functions are aligned at 16 bytes, it depends on where the previous function ended. And be sometime fast, and sometime not.
Which also means, if you benchmark, and you change code in one place, then code in another place may be re-aligned, and be faster or slower. Your total benchmark then may change more by the accidental align change, than by the change you tried to measure.

See https://lists.freepascal.org/pipermail/fpc-devel/2022-January/044336.html
Includes a very interesting video presentation on the topic

Code: Text  [Select][+][-]
1. 4000
2. 4016
3.
4. 3640
5. 3625
6.
7. 3610
8. 3625
9.
10. 4015
11. 4016
12.

Code: Pascal  [Select][+][-]
1. program Project1;
2.
3. {\$mode objfpc}{\$H+}
4.
5. uses
6.   {\$IFDEF UNIX}
8.   {\$ENDIF}
9.   Classes, SysUtils
10.   { you can add units after this };
11.
12. {\$R *.res}
13.
14. const
15.   N = 150*1024*1024;
16. var
17.   a, b, c: array of byte;
18.
19. procedure foo;
20. var
21.   i: Integer;
22. begin
23.   c[0] := (a[0] + b[0]) div 2;
24.
25.   for i := 1 to N-1 do begin
26.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
27.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
28.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
29.   end;
30. end;
31.
32. procedure foo2;
33. var
34.   i: Integer;
35. begin
36.   c[0] := (a[0] + b[0]) div 2;
37.
38.   asm
39.   .align 32
40.   end;
41.
42.   for i := 1 to N-1 do begin
43.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
44.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
45.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
46.   end;
47. end;
48.
49. procedure foo3;
50. var
51.   i: Integer;
52. begin
53.   c[0] := (a[0] + b[0]) div 2;
54.
55.   asm
56.   .align 32
57.   nop
58.   nop
59.   nop
60.   nop
61.   nop
62.   nop
63.   nop
64.   nop
65.   end;
66.
67.   for i := 1 to N-1 do begin
68.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
69.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
70.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
71.   end;
72. end;
73.
74. procedure foo4;
75. var
76.   i: Integer;
77. begin
78.   c[0] := (a[0] + b[0]) div 2;
79.
80.   asm
81.   .align 32
82.   nop
83.   nop
84.   nop
85.   nop
86.   nop
87.   nop
88.   nop
89.   nop
90.   nop
91.   nop
92.   nop
93.   nop
94.   nop
95.   nop
96.   nop
97.   nop
98.   end;
99.
100.   for i := 1 to N-1 do begin
101.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
102.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
103.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
104.   end;
105. end;
106.
107. var
108.   t: QWord;
109.   i: Integer;
110. begin
111.   SetLength(a, N);
112.   SetLength(b, N);
113.   SetLength(c, N);
114.   for i := 0 to N-1 do begin
115.     a[i] := Random(255);
116.     b[i] := Random(255);
117.   end;
118.
119.
120.   t := GetTickCount64;
121.   foo;
122.   t := GetTickCount64 -t;
123.   writeln(t);
124.
125.   t := GetTickCount64;
126.   foo;
127.   t := GetTickCount64 -t;
128.   writeln(t);
129.
130.
131.   t := GetTickCount64;
132.   foo2;
133.   t := GetTickCount64 -t;
134.   writeln(t);
135.
136.   t := GetTickCount64;
137.   foo2;
138.   t := GetTickCount64 -t;
139.   writeln(t);
140.
141.
142.   t := GetTickCount64;
143.   foo3;
144.   t := GetTickCount64 -t;
145.   writeln(t);
146.
147.   t := GetTickCount64;
148.   foo3;
149.   t := GetTickCount64 -t;
150.   writeln(t);
151.
152.
153.   t := GetTickCount64;
154.   foo4;
155.   t := GetTickCount64 -t;
156.   writeln(t);
157.
158.   t := GetTickCount64;
159.   foo4;
160.   t := GetTickCount64 -t;
161.   writeln(t);
162.
163.
165. end.
166.

#### furious programming

• Hero Member
• Posts: 620
##### Re: Useful optimizations for a video game project
« Reply #16 on: June 23, 2022, 09:52:18 pm »
Thank you very much for the example. I will definitely check this trick in the future.

But I just tested your test program on my Intel® Core™ i7-640LM (which is quite old) and I can't reproduce your results. Aligned code is slightly faster in the debug build mode (generated in the project options window), below are the results:

Code: Pascal  [Select][+][-]
1. 9203
2. 9125
3. 8860
4. 9031
5. 8969
6. 9000
7. 9312
8. 9250

but in the release mode (also generated by the Lazarus), there is no gain — aligned code is actually slower than not aligned:

Code: Pascal  [Select][+][-]
1. 1671
2. 1657
3. 1906
4. 1906
5. 1891
6. 1906
7. 1891
8. 1890

It looks like the optimizations itself are giving the best performance in this case.
« Last Edit: June 23, 2022, 10:09:33 pm by furious programming »
Lazarus 2.2.4 with FPC 3.2.2 (2022-05-15), Windows 10 — all 64-bit

Working solo on my own acrade, action/adventure game in retro style (pixelart), programming the engine and shell from scratch, using Free Pascal and SDL2. Release expected in 2025.

#### PascalDragon

• Hero Member
• Posts: 4765
• Compiler Developer
##### Re: Useful optimizations for a video game project
« Reply #17 on: June 24, 2022, 09:04:09 am »
Quote
I would assume that you'll need some thight kernel here programmed in asm which fully uses AVX2 / AVX512 (or comparable) capabilities to get somewhere.

I can always use calculations only on integers (as in the good old days), because high precision of calculations will not be required — after all, the image will be highly pixelated. But there will be time for that.

SIMD instruction sets are not restricted to floating point values, but can be used with integers as well. Thus if you have multiple, equivalent integer operations that can be done in parallel (e.g. adding a vector) you can utilize SIMD.

#### BrunoK

• Sr. Member
• Posts: 364
• Retired programmer
##### Re: Useful optimizations for a video game project
« Reply #18 on: June 24, 2022, 09:58:38 am »
Thank you very much for the example. I will definitely check this trick in the future.

But I just tested your test program on my Intel® Core™ i7-640LM (which is quite old) and I can't reproduce your results.
I can't either reproduce the results.

11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz   2.42 GHz (Laptop) :
Code: Pascal  [Select][+][-]
1. C:\fpc-laz\fpc\3.2.2-git\bin\i386-win32\ppc386.exe
2. -MObjFPC
3. -Scghi
4. -O1
5. -gw2
6. -godwarfsets
7. -gl
8. -l
9. -vewnhibq
10. -Filib\i386-win32
11. -Fu.
12. -FUlib\i386-win32
13. -FE.
14. -opgmSpeedTest.exe
15. -OoREGVAR
Compiling -O1 -OoREGVAR gives very satisfactory speed and reasonable debugging.
Times for I386 and x86_64 are very similar.

Timings win10 FPC 3.2.2 i386 :
719
734
1063
1062
1079
1062
1078
1063
Trying to align code seems to be counterproductive for -O1 -OoREGVAR

What is strange is that my times are lower than those of Martin on a fairly low range laptop (and also my desktop).

#### furious programming

• Hero Member
• Posts: 620
##### Re: Useful optimizations for a video game project
« Reply #19 on: June 24, 2022, 10:46:50 am »
Thus if you have multiple, equivalent integer operations that can be done in parallel (e.g. adding a vector) you can utilize SIMD.

This is the reason why the use of SIMD will not be possible — there will not be many same operations to be performed in parallel. And even if I wanted to process the frame in this way, the whole process would be much more complicated and much more difficult to implement than generating pixel by pixel separately.

The initial idea is to use a thread pool where each thread handles one ray and uses it to generate the target color of only one pixel. When the thread is done, it gets another pixel to generate — all the way to the end of the frame. After all, the buffer is streamed to the SDL texture — this is the only (and in my case very convenient, by the way) solution, as SDL does not support multi-threaded rendering.

What is strange is that my times are lower than those of Martin on a fairly low range laptop (and also my desktop).

We do not know what optimizations Martin used, although I assume he was the default. Therefore, both your laptop and my (8-year-old Lenovo X201 Tablet) give better performance results. But that's not important — the important thing is that manual code alignment doesn't give us any profit with strong optimizations used (or at least not always). I will have to be more interested in this topic and just check with the right code what the performance will look like with and without code alignment.
Lazarus 2.2.4 with FPC 3.2.2 (2022-05-15), Windows 10 — all 64-bit

Working solo on my own acrade, action/adventure game in retro style (pixelart), programming the engine and shell from scratch, using Free Pascal and SDL2. Release expected in 2025.

#### Martin_fr

• Hero Member
• Posts: 8483
• Debugger - SynEdit - and more
##### Re: Useful optimizations for a video game project
« Reply #20 on: June 24, 2022, 11:19:16 am »
What is strange is that my times are lower than those of Martin on a fairly low range laptop (and also my desktop).
It seems, while I did -O3 (which afaik includes -Or), I also left other stuff at defaults. Mainly -Criot - that takes time.

About the speed diff => I think the presence of asm code can affect the optimizer.
So that example did not (fully) show my point.

Actually, in my original example ignoring the first (non-asm) routine, I got 2 diff timings in routines with diff alignment.
Removing -Criot, I no longer get that diff => the code is maybe to simple for the cpu.

But (in the mail thread that I linked), I did have an example. At that time, I also found documentation that mentioned the alignment effect.

#### Paul_

• Full Member
• Posts: 141
##### Re: Useful optimizations for a video game project
« Reply #21 on: August 02, 2022, 05:03:46 pm »
Just wondering what type of game it is?

#### furious programming

• Hero Member
• Posts: 620
##### Re: Useful optimizations for a video game project
« Reply #22 on: August 02, 2022, 11:04:52 pm »
Just wondering what type of game it is?

It will be an action/adventure game, with mechanics and projection similar to The Legend of Zelda: A Link to the Past (1991, SNES), but much more extensive, with much nicer graphics (using low-resolution pixelart and special filters) and with couch co-op mode. PCs are thousands of times more powerful than the SNES, so there are practically no limits and I can extend it as much as I want.

I am currently working on the foundations of the game, i.e. window programming and video modes, and an advanced input mapping. Then I will take care of fonts and create controls for the UI of the game (something like mini-LCL). And then I will take care of the engine, that is, in 2-3 months. I hope to have a working prototype of the engine by the end of this year.
« Last Edit: August 02, 2022, 11:08:04 pm by furious programming »
Lazarus 2.2.4 with FPC 3.2.2 (2022-05-15), Windows 10 — all 64-bit

Working solo on my own acrade, action/adventure game in retro style (pixelart), programming the engine and shell from scratch, using Free Pascal and SDL2. Release expected in 2025.