Lazarus

Programming => Graphics and Multimedia => Games => Topic started by: furious programming on June 22, 2022, 05:41:18 pm

Title: Useful optimizations for a video game project
Post by: furious programming on June 22, 2022, 05:41:18 pm
I'm working on a (ultimately) large video game project. Currently, I have defined various optimizations for different build modes and in release mode I would like to have enough and such optimizations to make the output machine code run as fast as possible (the size of the executable file and memory consumption is negligible). Some information about the project:
I am currently using strong optimizations for the release mode (level 3):

Code: Pascal  [Select][+][-]
  1. {$IFDEF GAME_BUILD_DEBUG}
  2.   // debug mode settings (not important)
  3. {$ELSE}
  4.   {$INLINE       ON}
  5.   {$SMARTLINK    ON}
  6.   {$OPTIMIZATION LEVEL3}
  7.  
  8.   {$S-}
  9.   {$IOCHECKS       OFF}
  10.   {$RANGECHECKS    OFF}
  11.   {$OVERFLOWCHECKS OFF}
  12.   {$ASSERTIONS     OFF}
  13.   {$OBJECTCHECKS   OFF}
  14. {$ENDIF}

I know that in addition to the above, there are many different other optimizations that you can add yourself to the code, and some of them may be useful in my case, given the specifics of the project. Anyone have any idea what else can be unlocked to make the resulting machine code faster?

Note that I am asking generally (ahead of time), not because I currently have slow code and with compiler optimizations I want to speed it up. If anyone has additional questions, feel free to ask.
Title: Re: Useful optimizations for a video game project
Post by: Martin_fr on June 22, 2022, 07:00:25 pm
First of all, the most optimization potential lies in clever design of your code....

But, if you know there are no exceptions, and if you do use managed types (AnsiString / dyn array)
Code: Text  [Select][+][-]
  1. {$ImplicitExceptions off}

But, if an exception occurs this will leak memory. And as memory gets eaten up, side effects will get noticeable.

Avoid managed types, if you don't need them.



Choose a modern cpu type (CoreAvx or CoreI) if you don't need to support older cpu. (project options / target)



If you have a tight loop, in the middle of a long(er) routine, move the loop into a sub-routine (inlined) and call it.

At least in the past, this has sometimes helped the optimizer to do a better job with register allocations.




The classic: Move calculation out of the loop. Pre-calculate partial expressions, and only keep parts in the loop that depend on the loop counter.

For "SomeFoo[LoopCounter]" => use a pointer, and increase the pointer to the next item.
(though that one is in some cases redundant on modern cpu)




The very tricky bit, if you can align the start of small, but high-iteration-count loops to a 32 bit boundary => that can gain/loose a 2 digit percentage in speed.

Unfortunately, even functions are only aligned 16 bytes.

I have myself benchmarked code in the past. And just by changing the order in which procedures were declared (no other change), the speed varied by almost 20% to 30%.

At least on Intel. Because intel has some caches (IIRC for micro-code), that rely on 32 byte bounds.
So if you iterate some 1000 times over a loop, and if that loop has 32 or 64 bytes of code, then it runs fastest if it starts exactly on a 32 byte bound.

Unfortunately there is no option to enforce this. Maybe it can be done with asm blocks.
Title: Re: Useful optimizations for a video game project
Post by: 440bx on June 22, 2022, 09:20:14 pm
As Martin_fr stated, most really significant speed gains come from the design of the code.

You mentioned using GetMem, FreeMem and other heap functions to manage memory.  Very significant gains can be had by doing your own memory management. 

This means allocating your own blocks based on the usage of the data and carving the block yourself.  With the proper design, it is often possible to remove the need for critical sections (or other synch method) to allocate and deallocate memory blocks. 

Depending on how often the program needs to allocate and free memory blocks, doing your own management can make a very noticeable difference but, that requires memory allocation/deallocation design upfront designed specifically to accommodate the application's needs throughout its execution.

Depending on how you implement it, there can be another very significant advantage.  If every block to be carved is requested directly from the O/S (instead of a heap) then, an external memory viewer can be used to inspect the blocks.  When debugging, this makes memory inspection independent of the current instruction, i.e, the pointers to blocks don't have to be in the current scope, once you know the address, they can always be inspected using an external memory viewer.

HTH.
Title: Re: Useful optimizations for a video game project
Post by: furious programming on June 22, 2022, 10:24:07 pm
Thanks for the answers and advice.

But, if you know there are no exceptions, and if you do use managed types (AnsiString / dyn array)

I do not need exceptions at all — everything will be handled by error codes (like in SDL or pure WinAPI), so it would be good to remove everything related to exceptions from the code. There are appropriate tools (such as debug mode or HeapTrc unit) to check the correct operation of the code, and the release should be as effective as possible, without unnecessary instructions and additional checks.

Quote
But, if an exception occurs this will leak memory. And as memory gets eaten up, side effects will get noticeable.

This is intresting. Ideally, an exception should not be created at all. I wonder what happens if a program tries to perform an illegal operation — for example, accessing via an empty pointer or dividing by 0. Can you do this so that this operation does not cause any error and does not change the flow of control?

Quote
Avoid managed types, if you don't need them.

Theoretically, I might not use them, but I have a problem with strings. While SDL uses C-style strings (PAnsiChar only), it becomes a bit of a problem to use them — especially when it comes to concatenating and converting them. There are few built-in functions to support them, and virtually none to convert.

Quote
Choose a modern cpu type (CoreAvx or CoreI) if you don't need to support older cpu. (project options / target)

I don't have these CPU types in the project settings window. Initially, I only care about modern, 64-bit processors. Older processors (including 32-bit) will certainly not be supported.

Quote
For "SomeFoo[LoopCounter]" => use a pointer, and increase the pointer to the next item.
(though that one is in some cases redundant on modern cpu)

I will be running a lot of tests and choosing the best solutions. Iterated pointer access is actually faster than indexed access — I tested it some time ago. I learned a lot of interesting things after watching the video "How I program C" (https://www.youtube.com/watch?v=443UNeGrFoM) by Eskil Steenberg. I highly recommend listening — it doesn't matter it's C, because in Pascal we have the same.



You mentioned using GetMem, FreeMem and other heap functions to manage memory.  Very significant gains can be had by doing your own memory management.

I believe, but I'd rather focus on writing the right code for the project and not go that low. I don't know if something like this will be needed in my case.
Title: Re: Useful optimizations for a video game project
Post by: Martin_fr on June 22, 2022, 11:54:22 pm
Quote
This is intresting. Ideally, an exception should not be created at all. I wonder what happens if a program tries to perform an illegal operation — for example, accessing via an empty pointer or dividing by 0. Can you do this so that this operation does not cause any error and does not change the flow of control?

"div 0" can be caught as exception. And I think even some access violations can, but not sure.
But in any case, I usually care not to have those, rather than would my code still work if I had them.


If you don't use "raise" and don't have any "try except" blocks (including not doing/having by whatever any code does, that you use from frameworks etc) then "{$ImplicitExceptions off}" should be ok.

Code: Pascal  [Select][+][-]
  1. procedure foo;
  2. var s: ansistring;
  3. begin
  4.   s:= getVal;
  5.   // do some stuff
  6. end;

Fpc will insert code at the end of that procedure to do "s := ''" => i.e decrease the ref-count, and free the mem of the string, if not hold by other variables.

That will always happen.

But Fpc also encapsulates the entire procedure into an "try finally" block, to make sure "s" is freed, even if an exception occurred.
And with the "{$ImplicitExceptions off}" the "try finally" is not inserted.



Title: Re: Useful optimizations for a video game project
Post by: Martin_fr on June 23, 2022, 12:15:17 am
Quote
I don't have these CPU types in the project settings window. Initially, I only care about modern, 64-bit processors. Older processors (including 32-bit) will certainly not be supported.

My IDE shows them... But anyway. From
https://www.freepascal.org/docs-html/user/userap1.html
Quote
-Cp<x>     Select instruction set; see fpc -i or fpc -ic for possible values

Fpc 64 bit 3.2.0 to 3.3.1 on Windows all show
Code: Text  [Select][+][-]
  1. ATHLON64
  2. COREI
  3. COREAVX
  4. COREAVX2
  5.  

Afaik you can also enable different avx, and with that maybe get some of the extra registers used (though I am not sure....)



Do you follow the fpc mail list? Just in case, there have been various optimizer improvements in fpc 3.3.1

Title: Re: Useful optimizations for a video game project
Post by: 440bx on June 23, 2022, 02:24:19 am
I learned a lot of interesting things after watching the video "How I program C" (https://www.youtube.com/watch?v=443UNeGrFoM) by Eskil Steenberg. I highly recommend listening — it doesn't matter it's C, because in Pascal we have the same.
I watched a little over an hour of it and will watch the rest later (I don't have the time right now) but, I agree with you.  He really gives a lot of good advice (advice I've been giving for a very long time!... and nobody listens...  :D)

Quite a few times during the video, I thought, this guy should program in Pascal, he'd realize what a sh*tty language C is but, aside from his very poor choice of a programming language, he cares about making his programs as consistent, easy to understand and maintainable as possible (he is the unicorn of C programmers!)

Thank you for the link.


Title: Re: Useful optimizations for a video game project
Post by: Thaddy on June 23, 2022, 08:01:16 am
Using a couple of WPO cycles can speed up code too. And that is also true for procedural programming, although less so.
Title: Re: Useful optimizations for a video game project
Post by: PascalDragon on June 23, 2022, 08:55:19 am
Code: Pascal  [Select][+][-]
  1.   {$OPTIMIZATION LEVEL3}
  2.  

Please note that this is different from passing -O3 on the command line. It's essentially equivalent to -Oolevel3. It's currently not possible to enable all optimizations that are part of a specific level in code. You need to enable each optimization you want by hand.

Quote
But, if an exception occurs this will leak memory. And as memory gets eaten up, side effects will get noticeable.

This is intresting. Ideally, an exception should not be created at all. I wonder what happens if a program tries to perform an illegal operation — for example, accessing via an empty pointer or dividing by 0. Can you do this so that this operation does not cause any error and does not change the flow of control?

Not trivially, no. You'd need to hook yourself into the RTL's exception handling. Search for ErrorProc and ExceptProc if you want to go down this rabbit hole (but note that there'll always be an exception triggered by the processor and handled by the OS, the only part you can influence is how it's handled inside your program).
Title: Re: Useful optimizations for a video game project
Post by: MathMan on June 23, 2022, 11:43:26 am
I'm working on a (ultimately) large video game project. ...
  • the source code is procedural, using only simple records,
  • data is passed using pointers (same as in SDL API),
  • memory is allocated and deallocated manually (GetMem, AllocMem, ReallocMem and FreeMem are used everywhere),
  • data is encapsulated in records, and access to them is only possible through getters and setters (as global functions), so I often use inline.

I personally would also spend some/substantial time on looking at point [iii] from your list above. If possible try to isolate cases where you can get away with allocating one large block of mem initially and then organize this internally by passing pointers (plus the required pointer math) when calling sub-functions etc.

I know that this is tedious / tricky work, but from my own experience it is worthwhile.

Cheers,
MathMan
Title: Re: Useful optimizations for a video game project
Post by: furious programming on June 23, 2022, 12:56:06 pm
Fpc 64 bit 3.2.0 to 3.3.1 on Windows all show
Code: Text  [Select][+][-]
  1. ATHLON64
  2. COREI
  3. COREAVX
  4. COREAVX2
  5.  

I'm using stable version of Lazarus and FPC (2.2.0 and 3.2.2 respectively) and I have only ATHLON64 in this combobox.

Quote
Do you follow the fpc mail list? Just in case, there have been various optimizer improvements in fpc 3.3.1

No, I'm not following. But there is also no need to rush, because my project will be developed for a few more years (3-4), so I will update Lazarus and FPC more than once. For now, I also don't have much code to optimize — I just ask in advance.



Using a couple of WPO cycles can speed up code too. And that is also true for procedural programming, although less so.

Can you write something more about it?



Please note that this is different from passing -O3 on the command line. It's essentially equivalent to -Oolevel3. It's currently not possible to enable all optimizations that are part of a specific level in code. You need to enable each optimization you want by hand.

Interesting. From what I can see on the $OPTIMIZATION (https://www.freepascal.org/docs-html/prog/progsu58.html) document, there is no information on this. What exactly do I have to do, which optimizations to declare additionally to get optimizations compatible with -O3? And besides, which ones to be interested in?

Quote
(but note that there'll always be an exception triggered by the processor and handled by the OS, the only part you can influence is how it's handled inside your program).

This is what I wanted to know. Thanks for the clarification.

I will test the source code thoroughly anyway, so that there are no unexpected exceptions and memory leaks caused by incorrectly written code. However, if in any case that I did not catch, exceptions were to occur, it would be better in the release version to use incorrect data (e.g. causing glitches) than for control flow to become unpredictable or for the process to be killed.



I personally would also spend some/substantial time on looking at point [iii] from your list above. If possible try to isolate cases where you can get away with allocating one large block of mem initially and then organize this internally by passing pointers (plus the required pointer math) when calling sub-functions etc.

If necessary, I will definitely try to limit the dynamic allocation and deallocation of memory as much as possible.

However, at the moment I don't think memory operations are going to be a bottleneck. Especially that the game engine will ultimately preload as much data as possible into the memory so that everything is available during the game's operation. Mainly I mean map and object data, fully represented by Octree, which shouldn't require more than 1GB of memory. Operations on Octree shouldn't be too problematic either.

A performance problem will definitely arise in the case of rendering, because I want to use a multi-threaded purely software raytracing (for a very low resolution frame), where any saving of cycles will have a significant impact on performance. But rendering programming is still a long way off.
Title: Re: Useful optimizations for a video game project
Post by: PascalDragon on June 23, 2022, 01:57:05 pm
Please note that this is different from passing -O3 on the command line. It's essentially equivalent to -Oolevel3. It's currently not possible to enable all optimizations that are part of a specific level in code. You need to enable each optimization you want by hand.

Interesting. From what I can see on the $OPTIMIZATION (https://www.freepascal.org/docs-html/prog/progsu58.html) document, there is no information on this. What exactly do I have to do, which optimizations to declare additionally to get optimizations compatible with -O3? And besides, which ones to be interested in?

You simply list the desired optimizations as mentioned in the documentation to linked at. A list of supported optimizations is available when you do fpc -i. Please note that this list is specific to each CPU architecture.

I will test the source code thoroughly anyway, so that there are no unexpected exceptions and memory leaks caused by incorrectly written code. However, if in any case that I did not catch, exceptions were to occur, it would be better in the release version to use incorrect data (e.g. causing glitches) than for control flow to become unpredictable or for the process to be killed.

I personally prefer to kill the program than have it continue with incorrect data (that's why, if no SysUtils unit is used, the default is to simply terminate the application if an error occurred).
Title: Re: Useful optimizations for a video game project
Post by: MathMan on June 23, 2022, 02:58:56 pm
I personally would also spend some/substantial time on looking at point [iii] from your list above. If possible try to isolate cases where you can get away with allocating one large block of mem initially and then organize this internally by passing pointers (plus the required pointer math) when calling sub-functions etc.

If necessary, I will definitely try to limit the dynamic allocation and deallocation of memory as much as possible.

However, at the moment I don't think memory operations are going to be a bottleneck. Especially that the game engine will ultimately preload as much data as possible into the memory so that everything is available during the game's operation. Mainly I mean map and object data, fully represented by Octree, which shouldn't require more than 1GB of memory. Operations on Octree shouldn't be too problematic either.

A performance problem will definitely arise in the case of rendering, because I want to use a multi-threaded purely software raytracing (for a very low resolution frame), where any saving of cycles will have a significant impact on performance. But rendering programming is still a long way off.

I can only state that I had some nasty surprises wrt dynamic memory allocation in some recursive procedures. Of course this was special as the allocations became smaller & smaller with each recursion level, but I was only able to get this to decent speeds after I completely removed allocs from the recursive procedure.

Regarding ray-tracing - that really depends on what you define as "very low resolution". I would assume that you'll need some thight kernel here programmed in asm which fully uses AVX2 / AVX512 (or comparable) capabilities to get somewhere.

Cheers,
MathMan
Title: Re: Useful optimizations for a video game project
Post by: SymbolicFrank on June 23, 2022, 04:46:54 pm
Optimize your time budget and don't try to optimize the 90% of your code that is not time-critical.
Title: Re: Useful optimizations for a video game project
Post by: furious programming on June 23, 2022, 05:32:52 pm
You simply list the desired optimizations as mentioned in the documentation to linked at. A list of supported optimizations is available when you do fpc -i. Please note that this list is specific to each CPU architecture.

I checked this option and there are many features available, thanks. It is interesting that there are more instruction sets available:

Code: Pascal  [Select][+][-]
  1. Supported CPU instruction sets:
  2.   ATHLON64,COREI,COREAVX,COREAVX2

but in the project settings window I only have ATHLON64. Weird.

I personally prefer to kill the program than have it continue with incorrect data (that's why, if no SysUtils unit is used, the default is to simply terminate the application if an error occurred).

For now, I do not anticipate any unexpected errors, I will try to write the code so that it does not cause exceptions. If so, the most sensible solution will be selected.



Regarding ray-tracing - that really depends on what you define as "very low resolution".

The internal back buffer will have a resolution of 288×240 pixels, which will require a color calculation for 69,120 pixels in each game frame (using as many threads as there are logical processors). The game will ultimately use pixelart graphics, i.e. it will be in a retro style. For now, it's hard to say if I can achieve enough performance to take advantage of software ray-tracing on low-end PCs (which I care about), so standard rasterization will be the default rendering method (faster but poorer).

Quote
I would assume that you'll need some thight kernel here programmed in asm which fully uses AVX2 / AVX512 (or comparable) capabilities to get somewhere.

I can always use calculations only on integers (as in the good old days), because high precision of calculations will not be required — after all, the image will be highly pixelated. But there will be time for that.



Optimize your time budget and don't try to optimize the 90% of your code that is not time-critical.

Good point. I don't have a code like this yet, but the sooner I find out about the possibilities, the more time I will save in the future. Thanks for the answers.
Title: Re: Useful optimizations for a video game project
Post by: Martin_fr on June 23, 2022, 06:34:08 pm
So here is an example for the 32 byte alignment

"foo" has whatever alignment it gets by surrounding code. Also, its loop is offset by the code in front of it.
It takes 4000 ms (on my PC:  I7-8700)

Then the loop at exactly 32 byte aligned: 3640 ms (almost 10% faster)
The loop with an offset of 32+8 also is fast => so relevant code inside the loop must have just hit the right alignment.

The loop that is intentionally 32+16 takes 4000.

So (on modern CPU), just adding the right align can make a noticeable diff.

And since functions are aligned at 16 bytes, it depends on where the previous function ended. And be sometime fast, and sometime not.
Which also means, if you benchmark, and you change code in one place, then code in another place may be re-aligned, and be faster or slower. Your total benchmark then may change more by the accidental align change, than by the change you tried to measure.

See https://lists.freepascal.org/pipermail/fpc-devel/2022-January/044336.html
Includes a very interesting video presentation on the topic


Code: Text  [Select][+][-]
  1. 4000
  2. 4016
  3.  
  4. 3640
  5. 3625
  6.  
  7. 3610
  8. 3625
  9.  
  10. 4015
  11. 4016
  12.  

Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   {$IFDEF UNIX}
  7.   cthreads,
  8.   {$ENDIF}
  9.   Classes, SysUtils
  10.   { you can add units after this };
  11.  
  12. {$R *.res}
  13.  
  14. const
  15.   N = 150*1024*1024;
  16. var
  17.   a, b, c: array of byte;
  18.  
  19. procedure foo;
  20. var
  21.   i: Integer;
  22. begin
  23.   c[0] := (a[0] + b[0]) div 2;
  24.  
  25.   for i := 1 to N-1 do begin
  26.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  27.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  28.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  29.   end;
  30. end;
  31.  
  32. procedure foo2;
  33. var
  34.   i: Integer;
  35. begin
  36.   c[0] := (a[0] + b[0]) div 2;
  37.  
  38.   asm
  39.   .align 32
  40.   end;
  41.  
  42.   for i := 1 to N-1 do begin
  43.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  44.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  45.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  46.   end;
  47. end;
  48.  
  49. procedure foo3;
  50. var
  51.   i: Integer;
  52. begin
  53.   c[0] := (a[0] + b[0]) div 2;
  54.  
  55.   asm
  56.   .align 32
  57.   nop
  58.   nop
  59.   nop
  60.   nop
  61.   nop
  62.   nop
  63.   nop
  64.   nop
  65.   end;
  66.  
  67.   for i := 1 to N-1 do begin
  68.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  69.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  70.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  71.   end;
  72. end;
  73.  
  74. procedure foo4;
  75. var
  76.   i: Integer;
  77. begin
  78.   c[0] := (a[0] + b[0]) div 2;
  79.  
  80.   asm
  81.   .align 32
  82.   nop
  83.   nop
  84.   nop
  85.   nop
  86.   nop
  87.   nop
  88.   nop
  89.   nop
  90.   nop
  91.   nop
  92.   nop
  93.   nop
  94.   nop
  95.   nop
  96.   nop
  97.   nop
  98.   end;
  99.  
  100.   for i := 1 to N-1 do begin
  101.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  102.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  103.     c[i] := ( (a[i] + b[i]) div 2) xor c[i-1];
  104.   end;
  105. end;
  106.  
  107. var
  108.   t: QWord;
  109.   i: Integer;
  110. begin
  111.   SetLength(a, N);
  112.   SetLength(b, N);
  113.   SetLength(c, N);
  114.   for i := 0 to N-1 do begin
  115.     a[i] := Random(255);
  116.     b[i] := Random(255);
  117.   end;
  118.  
  119.  
  120.   t := GetTickCount64;
  121.   foo;
  122.   t := GetTickCount64 -t;
  123.   writeln(t);
  124.  
  125.   t := GetTickCount64;
  126.   foo;
  127.   t := GetTickCount64 -t;
  128.   writeln(t);
  129.  
  130.  
  131.   t := GetTickCount64;
  132.   foo2;
  133.   t := GetTickCount64 -t;
  134.   writeln(t);
  135.  
  136.   t := GetTickCount64;
  137.   foo2;
  138.   t := GetTickCount64 -t;
  139.   writeln(t);
  140.  
  141.  
  142.   t := GetTickCount64;
  143.   foo3;
  144.   t := GetTickCount64 -t;
  145.   writeln(t);
  146.  
  147.   t := GetTickCount64;
  148.   foo3;
  149.   t := GetTickCount64 -t;
  150.   writeln(t);
  151.  
  152.  
  153.   t := GetTickCount64;
  154.   foo4;
  155.   t := GetTickCount64 -t;
  156.   writeln(t);
  157.  
  158.   t := GetTickCount64;
  159.   foo4;
  160.   t := GetTickCount64 -t;
  161.   writeln(t);
  162.  
  163.  
  164.   readln;
  165. end.
  166.  
Title: Re: Useful optimizations for a video game project
Post by: furious programming on June 23, 2022, 09:52:18 pm
Thank you very much for the example. I will definitely check this trick in the future.

But I just tested your test program on my Intel® Core™ i7-640LM (https://ark.intel.com/content/www/us/en/ark/products/43563/intel-core-i7640lm-processor-4m-cache-2-13-ghz.html) (which is quite old) and I can't reproduce your results. Aligned code is slightly faster in the debug build mode (generated in the project options window), below are the results:

Code: Pascal  [Select][+][-]
  1. 9203
  2. 9125
  3. 8860
  4. 9031
  5. 8969
  6. 9000
  7. 9312
  8. 9250

but in the release mode (also generated by the Lazarus), there is no gain — aligned code is actually slower than not aligned:

Code: Pascal  [Select][+][-]
  1. 1671
  2. 1657
  3. 1906
  4. 1906
  5. 1891
  6. 1906
  7. 1891
  8. 1890

It looks like the optimizations itself are giving the best performance in this case.
Title: Re: Useful optimizations for a video game project
Post by: PascalDragon on June 24, 2022, 09:04:09 am
Quote
I would assume that you'll need some thight kernel here programmed in asm which fully uses AVX2 / AVX512 (or comparable) capabilities to get somewhere.

I can always use calculations only on integers (as in the good old days), because high precision of calculations will not be required — after all, the image will be highly pixelated. But there will be time for that.

SIMD instruction sets are not restricted to floating point values, but can be used with integers as well. Thus if you have multiple, equivalent integer operations that can be done in parallel (e.g. adding a vector) you can utilize SIMD.
Title: Re: Useful optimizations for a video game project
Post by: BrunoK on June 24, 2022, 09:58:38 am
Thank you very much for the example. I will definitely check this trick in the future.

But I just tested your test program on my Intel® Core™ i7-640LM (https://ark.intel.com/content/www/us/en/ark/products/43563/intel-core-i7640lm-processor-4m-cache-2-13-ghz.html) (which is quite old) and I can't reproduce your results.
I can't either reproduce the results.

11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz   2.42 GHz (Laptop) :
Code: Pascal  [Select][+][-]
  1. C:\fpc-laz\fpc\3.2.2-git\bin\i386-win32\ppc386.exe
  2. -MObjFPC
  3. -Scghi
  4. -O1
  5. -gw2
  6. -godwarfsets
  7. -gl
  8. -l
  9. -vewnhibq
  10. -Filib\i386-win32
  11. -Fu.
  12. -FUlib\i386-win32
  13. -FE.
  14. -opgmSpeedTest.exe
  15. -OoREGVAR
Compiling -O1 -OoREGVAR gives very satisfactory speed and reasonable debugging.
Times for I386 and x86_64 are very similar.

Timings win10 FPC 3.2.2 i386 :
719
734
1063
1062
1079
1062
1078
1063
Trying to align code seems to be counterproductive for -O1 -OoREGVAR

What is strange is that my times are lower than those of Martin on a fairly low range laptop (and also my desktop).
Title: Re: Useful optimizations for a video game project
Post by: furious programming on June 24, 2022, 10:46:50 am
Thus if you have multiple, equivalent integer operations that can be done in parallel (e.g. adding a vector) you can utilize SIMD.

This is the reason why the use of SIMD will not be possible — there will not be many same operations to be performed in parallel. And even if I wanted to process the frame in this way, the whole process would be much more complicated and much more difficult to implement than generating pixel by pixel separately.

The initial idea is to use a thread pool where each thread handles one ray and uses it to generate the target color of only one pixel. When the thread is done, it gets another pixel to generate — all the way to the end of the frame. After all, the buffer is streamed to the SDL texture — this is the only (and in my case very convenient, by the way) solution, as SDL does not support multi-threaded rendering.



What is strange is that my times are lower than those of Martin on a fairly low range laptop (and also my desktop).

We do not know what optimizations Martin used, although I assume he was the default. Therefore, both your laptop and my (8-year-old Lenovo X201 Tablet) give better performance results. But that's not important — the important thing is that manual code alignment doesn't give us any profit with strong optimizations used (or at least not always). I will have to be more interested in this topic and just check with the right code what the performance will look like with and without code alignment.
Title: Re: Useful optimizations for a video game project
Post by: Martin_fr on June 24, 2022, 11:19:16 am
What is strange is that my times are lower than those of Martin on a fairly low range laptop (and also my desktop).
It seems, while I did -O3 (which afaik includes -Or), I also left other stuff at defaults. Mainly -Criot - that takes time.

About the speed diff => I think the presence of asm code can affect the optimizer.
So that example did not (fully) show my point.

Actually, in my original example ignoring the first (non-asm) routine, I got 2 diff timings in routines with diff alignment.
Removing -Criot, I no longer get that diff => the code is maybe to simple for the cpu.

But (in the mail thread that I linked), I did have an example. At that time, I also found documentation that mentioned the alignment effect.
Title: Re: Useful optimizations for a video game project
Post by: Paul_ on August 02, 2022, 05:03:46 pm
Just wondering what type of game it is?
Title: Re: Useful optimizations for a video game project
Post by: furious programming on August 02, 2022, 11:04:52 pm
Just wondering what type of game it is?

It will be an action/adventure game, with mechanics and projection similar to The Legend of Zelda: A Link to the Past (https://en.wikipedia.org/wiki/The_Legend_of_Zelda:_A_Link_to_the_Past) (1991, SNES), but much more extensive, with much nicer graphics (using low-resolution pixelart and special filters) and with couch co-op mode. PCs are thousands of times more powerful than the SNES, so there are practically no limits and I can extend it as much as I want.

I am currently working on the foundations of the game, i.e. window programming and video modes, and an advanced input mapping. Then I will take care of fonts and create controls for the UI of the game (something like mini-LCL). And then I will take care of the engine, that is, in 2-3 months. I hope to have a working prototype of the engine by the end of this year.
TinyPortal © 2005-2018