### Bookstore

 Computer Math and Games in Pascal (preview) Lazarus Handbook

### Author Topic: need some optimization  (Read 9216 times)

#### airpas

• Full Member
• Posts: 179
##### need some optimization
« on: March 20, 2013, 03:28:54 pm »
hi every one
recently i start porting SOL's SDL tutorial to FPC . the ported examples was successfully built with FPC but ..... the problem is (speed) . . the origianl C examples is faster than FPC ones 3x time
and some example 4x

so i attached one of them , hope some one will find which part of the code slowdown the rendering

thanks

#### lainz

• Guest
##### Re: need some optimization
« Reply #1 on: March 20, 2013, 07:08:27 pm »
I don't know, but it's beautifull.

#### Martin_fr

• Hero Member
• Posts: 6606
• Debugger - SynEdit - and more
##### Re: need some optimization
« Reply #2 on: March 20, 2013, 07:45:15 pm »
I am not sure, there may be graphic libraries (lige bgra) that are already well optimized.

Anyway, if you want to do your version:

FPC does not do all optimizations as well as gcc. So you need to do them by hand.  (I assume you compile with -O3

One think to look at is loops.
Code: [Select]
`        for j := 0 to 15 do        begin            if (sprite[c] <> 0) then              p32Array(screen^.pixels)[yofs + j] := color;`
Does need to calculate 15 times, the pointer + yofs + j

Code: [Select]
`var pix: ^longword// before the loop)  pix :# @(p32Array(screen^.pixels)[yofs] );// in the loop  pix^ := color  inc(pix);`
Do a similar optimization for the outer loop...
Same for sprite[c]: use a pointer, inc the pointer, drop c

Replace
yofs += screen^.pitch div 4;
by increasing the above counter
pix := pix + NextLineDiff

where NextLineDiff is calculated once only
NextLineDiff := (screen^.pitch div 4) - 16; already increased by 16 in the loop

-------------------
targetr := trunc((sourcer + targetr) / 2);

use the div operator

Why shift all right, and then shift them left again

targetG := (target ) and \$ff00;
sourceg := (source ) and \$ff00;
AvgG :=  ((targetG + sourceg) div 2) and \$ff00;

Well I am not sure that is faster...
But maybe even this will work

Once you took out Green (the middle one) you can do the other 2 in one op:

(((Target and \$ff00ff) + (source and \$ff00ff)) div 2) and \$ff00ff

#### Laksen

• Hero Member
• Posts: 656
##### Re: need some optimization
« Reply #3 on: March 20, 2013, 07:52:44 pm »
Division by an integer in C is the same as an integer division in pascal.

The following correction doubled performance in blend_avg
Code: [Select]
`    targetr := ((sourcer + targetr) div 2);    targetg := ((sourceg + targetg) div 2);    targetb := ((sourceb + targetb) div 2);`

#### DelphiFreak

• Sr. Member
• Posts: 251
##### Re: need some optimization
« Reply #4 on: March 20, 2013, 08:20:08 pm »
Try to avoid "trunc" calculation's.

This will give you 10 frames per second's more:

procedure scaleblit;
var
i, j, k, yofs, c: integer;
begin
yofs := 0;
for i := 0 to 480 - 1 do begin
k:=trunc((i * 0.95) + 12) * 640;
for j := 0 to 640 - 1 do begin
c := k + trunc((j * 0.95) + 16);
p32Array(screen^.pixels)[yofs + j] := blend_avg(p32Array(screen^.pixels)[yofs + j], p32Array(tempbuf)[c]);
end;
yofs += screen^.pitch shr 2;
end;
end;

Edit: With the in the post before mentioned change to

targetr := ((sourcer + targetr) div 2);
targetg := ((sourceg + targetg) div 2);
targetb := ((sourceb + targetb) div 2);

in blend_avg the FPC version is now faster than the C version.
« Last Edit: March 20, 2013, 08:23:33 pm by DelphiFreak »
Linux Mint 19.1, Lazarus 2.0, Windows 7&10, Delphi 7, Delphi 10.3 Rio

#### airpas

• Full Member
• Posts: 179
##### Re: need some optimization
« Reply #5 on: March 20, 2013, 09:35:31 pm »
yes div increase fps by 12 in my machine , but its less accurate , there is not fpu calculation , just integers
AFAIK there is some issue about array index with free pascal , its not well optimized unlike delphi

@Martin_fr : thanks for the suggestions , i tried them all , only div increase the fps but as i said its less accrate .

the only way i see to beat gcc  is scrolling down to asm level which is more pain

#### Martin_fr

• Hero Member
• Posts: 6606
• Debugger - SynEdit - and more
##### Re: need some optimization
« Reply #6 on: March 20, 2013, 10:39:25 pm »

the only way i see to beat gcc  is scrolling down to asm level which is more pain

yes "div 2" may be 0.5 off. But if you use trunc() then you loose the exact same accuracy. Your result is to be integer, so you cannot store .5

#### Leledumbo

• Hero Member
• Posts: 8266
• Programming + Glam Metal + Tae Kwon Do = Me
##### Re: need some optimization
« Reply #7 on: March 20, 2013, 11:32:24 pm »
This C part:
Quote
void render()
{
...
for (i = 0; i < 480; i++)
memcpy(tempbuf + i * 640,
((unsigned long*)screen->pixels) +
i * PITCH, 640 * 4);
...
}
gets translated to:
Quote
for i := 0 to 480*640-1 do
p32Array(tempbuf) := p32Array(screen^.pixels);
Which is far from optimal. The C version uses memcpy which is very likely to be optimized to use processor's specific data copy instruction, use FPC's System.Move to get the same effect.

#### airpas

• Full Member
• Posts: 179
##### Re: need some optimization
« Reply #8 on: March 21, 2013, 06:17:38 am »
Quote
use FPC's System.Move to get the same effect.
ok i replace it with this : move(p32Array(screen^.pixels)[0], p32Array(tempbuf)[0],640*480*4);

and nothing changed .

the slowness come from scaleblit() procedure . i don't know how to optimize it more.
« Last Edit: March 21, 2013, 06:21:06 am by airpas »

#### airpas

• Full Member
• Posts: 179
##### Re: need some optimization
« Reply #9 on: March 21, 2013, 06:36:06 am »
when turn on SSE2 math in both compilers , look what i get

-FPC still with the same FPS
-GCC faster 2x than the previous one . which mean now gcc faster 6x than our compiler
« Last Edit: March 21, 2013, 06:52:55 am by airpas »

#### Leledumbo

• Hero Member
• Posts: 8266
• Programming + Glam Metal + Tae Kwon Do = Me
##### Re: need some optimization
« Reply #10 on: March 21, 2013, 04:48:24 pm »
I wonder why the difference is so big, so I try to profile my optimized version (attached) using gprof (also attached). The program is compiled with:
Quote
-O4 -CpPENTIUMM -OpPENTIUMM -CfSSE3
using FPC 2.7.1 from a few days ago. Surprisingly, the bottleneck was 64-bit calculations. Looking at the generated ASM (again attached), indeed that was the problem, esp. on line 189:
Code: [Select]
`c :=  trunc((i * 0.95) + 12) * 640 +`
Please analyze the profiling result (we need answer from compiler guys what makes FPC makes the calculation on line 189 64-bit (and somewhere else that generates fpc_mul_qword, I can't find in the source).

#### marcov

• Global Moderator
• Hero Member
• Posts: 8725
• FPC developer.
##### Re: need some optimization
« Reply #11 on: March 21, 2013, 05:08:33 pm »
I wonder why the difference is so big, so I try to profile my optimized version (attached) using gprof (also attached). The program is compiled with:
Quote
-O4 -CpPENTIUMM -OpPENTIUMM -CfSSE3
using FPC 2.7.1 from a few days ago. Surprisingly, the bottleneck was 64-bit calculations. Looking at the generated ASM (again attached), indeed that was the problem, esp. on line 189:
Code: [Select]
`c :=  trunc((i * 0.95) + 12) * 640 +`
Please analyze the profiling result (we need answer from compiler guys what makes FPC makes the calculation on line 189 64-bit (and somewhere else that generates fpc_mul_qword, I can't find in the source).

Well, I'm no compiler guy, but this has happened before:
Typically mixing signed and unsigned integers.    The range of unsigned + signed is "32 and an half" bits which is > 32-bit -> 64-bit.  If I quick look into your source I see longwords there, and a signed type (integer or return value from some function like trunc()) is easily found.

Remedy: cast the integers (and routines like trunc()) to longwords if they are unsigned anyway, or cast every unsigned to integer first.

#### Leledumbo

• Hero Member
• Posts: 8266
• Programming + Glam Metal + Tae Kwon Do = Me
##### Re: need some optimization
« Reply #12 on: March 21, 2013, 05:44:11 pm »
Quote
Remedy: cast the integers (and routines like trunc()) to longwords if they are unsigned anyway, or cast every unsigned to integer first.
Done, no more fpc_mul_xxx generated, but the FPS is still around 30. Plus, I drop the optimization to level 3 (AV with level 4). Attached are the changes.

#### airpas

• Full Member
• Posts: 179
##### Re: need some optimization
« Reply #13 on: March 22, 2013, 12:08:21 pm »
this is my optimization of scaleblit() , in fact i got around 50fps now  without see the output asm .
as you see i simplified the previous formula to a small pieces of operations .
Code: [Select]
`procedure scaleblit();var    i, j, yofs,c :integer;begin    yofs := 0;    for i := 0  to 480-1 do    begin        for j := 0 to 640 - 1 do        begin             c := trunc(i * 0.95);             c += 12;             c *= 640;             c += trunc(j * 0.95);             c += 16;             p32Array(screen^.pixels)[yofs + j] :=                   blend_avg(p32Array(screen^.pixels)[yofs + j], p32Array(tempbuf)[c]);        end;        yofs += screen^.pitch shr 2;    end;end;      `
« Last Edit: March 22, 2013, 12:10:40 pm by airpas »

#### airpas

• Full Member
• Posts: 179
##### Re: need some optimization
« Reply #14 on: March 22, 2013, 05:29:45 pm »
i reach  gcc speed ( 80FPS ) , in fact i changed trunc with round .

Code: [Select]
`procedure scaleblit();var    i, j, yofs,c :integer;    pl,pb : p32Array;    n : integer;begin    yofs := 0;    pl := screen.pixels;    pb := pointer(tempbuf);    for i := 0  to 480-1 do    begin        for j := 0 to 640 - 1 do        begin             c := round(i * 0.95);             c := (c + 12)*640;             inc(c,round(j * 0.95));             inc(c,16);             pl[yofs + j] := blend_avg(pl[yofs + j], pb[c]);        end;        inc(yofs, screen^.pitch shr 2);    end;end;`