Lazarus

Programming => Graphics and Multimedia => Graphics => Topic started by: airpas on March 20, 2013, 03:28:54 pm

Title: need some optimization
Post by: airpas on March 20, 2013, 03:28:54 pm
hi every one
recently i start porting SOL's SDL tutorial to FPC . the ported examples was successfully built with FPC but ..... the problem is (speed) . . the origianl C examples is faster than FPC ones 3x time
and some example 4x

so i attached one of them , hope some one will find which part of the code slowdown the rendering

thanks
Title: Re: need some optimization
Post by: lainz on March 20, 2013, 07:08:27 pm
I don't know, but it's beautifull.
Title: Re: need some optimization
Post by: Martin_fr on March 20, 2013, 07:45:15 pm
I am not sure, there may be graphic libraries (lige bgra) that are already well optimized.

Anyway, if you want to do your version:

FPC does not do all optimizations as well as gcc. So you need to do them by hand.  (I assume you compile with -O3

One think to look at is loops.
Code: [Select]
        for j := 0 to 15 do
        begin
            if (sprite[c] <> 0) then
              p32Array(screen^.pixels)[yofs + j] := color;

Does need to calculate 15 times, the pointer + yofs + j

instead get
Code: [Select]
var pix: ^longword
// before the loop)
  pix :# @(p32Array(screen^.pixels)[yofs] );
// in the loop
  pix^ := color
  inc(pix);

Do a similar optimization for the outer loop...
Same for sprite[c]: use a pointer, inc the pointer, drop c

Replace
        yofs += screen^.pitch div 4;
by increasing the above counter
   pix := pix + NextLineDiff

where NextLineDiff is calculated once only
   NextLineDiff := (screen^.pitch div 4) - 16; already increased by 16 in the loop


-------------------
    targetr := trunc((sourcer + targetr) / 2);

use the div operator

Why shift all right, and then shift them left again

targetG := (target ) and $ff00;
sourceg := (source ) and $ff00;
AvgG :=  ((targetG + sourceg) div 2) and $ff00;

Well I am not sure that is faster...
But maybe even this will work

Once you took out Green (the middle one) you can do the other 2 in one op:

(((Target and $ff00ff) + (source and $ff00ff)) div 2) and $ff00ff
Title: Re: need some optimization
Post by: Laksen on March 20, 2013, 07:52:44 pm
Division by an integer in C is the same as an integer division in pascal.

The following correction doubled performance in blend_avg
Code: [Select]
    targetr := ((sourcer + targetr) div 2);
    targetg := ((sourceg + targetg) div 2);
    targetb := ((sourceb + targetb) div 2);
Title: Re: need some optimization
Post by: DelphiFreak on March 20, 2013, 08:20:08 pm
Try to avoid "trunc" calculation's.

This will give you 10 frames per second's more:

  procedure scaleblit;
  var
    i, j, k, yofs, c: integer;
  begin
    yofs := 0;
    for i := 0 to 480 - 1 do begin
      k:=trunc((i * 0.95) + 12) * 640;
      for j := 0 to 640 - 1 do begin
        c := k + trunc((j * 0.95) + 16);
        p32Array(screen^.pixels)[yofs + j] := blend_avg(p32Array(screen^.pixels)[yofs + j], p32Array(tempbuf)[c]);
      end;
      yofs += screen^.pitch shr 2;
    end;
  end; 

Edit: With the in the post before mentioned change to

    targetr := ((sourcer + targetr) div 2);
    targetg := ((sourceg + targetg) div 2);
    targetb := ((sourceb + targetb) div 2);

in blend_avg the FPC version is now faster than the C version.
Title: Re: need some optimization
Post by: airpas on March 20, 2013, 09:35:31 pm
yes div increase fps by 12 in my machine , but its less accurate , there is not fpu calculation , just integers
AFAIK there is some issue about array index with free pascal , its not well optimized unlike delphi

@Martin_fr : thanks for the suggestions , i tried them all , only div increase the fps but as i said its less accrate .

the only way i see to beat gcc  is scrolling down to asm level which is more pain
Title: Re: need some optimization
Post by: Martin_fr on March 20, 2013, 10:39:25 pm

the only way i see to beat gcc  is scrolling down to asm level which is more pain

yes "div 2" may be 0.5 off. But if you use trunc() then you loose the exact same accuracy. Your result is to be integer, so you cannot store .5
Title: Re: need some optimization
Post by: Leledumbo on March 20, 2013, 11:32:24 pm
This C part:
Quote
void render()
{
...
    for (i = 0; i < 480; i++)
        memcpy(tempbuf + i * 640,
               ((unsigned long*)screen->pixels) +
               i * PITCH, 640 * 4);
...
}
gets translated to:
Quote
for i := 0 to 480*640-1 do
     p32Array(tempbuf) := p32Array(screen^.pixels);
Which is far from optimal. The C version uses memcpy which is very likely to be optimized to use processor's specific data copy instruction, use FPC's System.Move to get the same effect.
Title: Re: need some optimization
Post by: airpas on March 21, 2013, 06:17:38 am
Quote
use FPC's System.Move to get the same effect.
ok i replace it with this : move(p32Array(screen^.pixels)[0], p32Array(tempbuf)[0],640*480*4);

and nothing changed .

the slowness come from scaleblit() procedure . i don't know how to optimize it more.
Title: Re: need some optimization
Post by: airpas on March 21, 2013, 06:36:06 am
when turn on SSE2 math in both compilers , look what i get

-FPC still with the same FPS
-GCC faster 2x than the previous one . which mean now gcc faster 6x than our compiler
Title: Re: need some optimization
Post by: Leledumbo on March 21, 2013, 04:48:24 pm
I wonder why the difference is so big, so I try to profile my optimized version (attached) using gprof (also attached). The program is compiled with:
Quote
-O4 -CpPENTIUMM -OpPENTIUMM -CfSSE3
using FPC 2.7.1 from a few days ago. Surprisingly, the bottleneck was 64-bit calculations. Looking at the generated ASM (again attached), indeed that was the problem, esp. on line 189:
Code: [Select]
c :=  trunc((i * 0.95) + 12) * 640 +
Please analyze the profiling result (we need answer from compiler guys what makes FPC makes the calculation on line 189 64-bit (and somewhere else that generates fpc_mul_qword, I can't find in the source).
Title: Re: need some optimization
Post by: marcov on March 21, 2013, 05:08:33 pm
I wonder why the difference is so big, so I try to profile my optimized version (attached) using gprof (also attached). The program is compiled with:
Quote
-O4 -CpPENTIUMM -OpPENTIUMM -CfSSE3
using FPC 2.7.1 from a few days ago. Surprisingly, the bottleneck was 64-bit calculations. Looking at the generated ASM (again attached), indeed that was the problem, esp. on line 189:
Code: [Select]
c :=  trunc((i * 0.95) + 12) * 640 +
Please analyze the profiling result (we need answer from compiler guys what makes FPC makes the calculation on line 189 64-bit (and somewhere else that generates fpc_mul_qword, I can't find in the source).

Well, I'm no compiler guy, but this has happened before:
Typically mixing signed and unsigned integers.    The range of unsigned + signed is "32 and an half" bits which is > 32-bit -> 64-bit.  If I quick look into your source I see longwords there, and a signed type (integer or return value from some function like trunc()) is easily found.

Remedy: cast the integers (and routines like trunc()) to longwords if they are unsigned anyway, or cast every unsigned to integer first.
Title: Re: need some optimization
Post by: Leledumbo on March 21, 2013, 05:44:11 pm
Quote
Remedy: cast the integers (and routines like trunc()) to longwords if they are unsigned anyway, or cast every unsigned to integer first.
Done, no more fpc_mul_xxx generated, but the FPS is still around 30. Plus, I drop the optimization to level 3 (AV with level 4). Attached are the changes.
Title: Re: need some optimization
Post by: airpas on March 22, 2013, 12:08:21 pm
this is my optimization of scaleblit() , in fact i got around 50fps now  without see the output asm .
as you see i simplified the previous formula to a small pieces of operations .
Code: [Select]
procedure scaleblit();
var
    i, j, yofs,c :integer;
begin
    yofs := 0;
    for i := 0  to 480-1 do
    begin
        for j := 0 to 640 - 1 do
        begin
             c := trunc(i * 0.95);
             c += 12;
             c *= 640;
             c += trunc(j * 0.95);
             c += 16;
             p32Array(screen^.pixels)[yofs + j] :=
                   blend_avg(p32Array(screen^.pixels)[yofs + j], p32Array(tempbuf)[c]);
        end;
        yofs += screen^.pitch shr 2;
    end;

end;     
Title: Re: need some optimization
Post by: airpas on March 22, 2013, 05:29:45 pm
:D i reach  gcc speed ( 80FPS ) , in fact i changed trunc with round .

Code: [Select]
procedure scaleblit();
var
    i, j, yofs,c :integer;
    pl,pb : p32Array;
    n : integer;
begin
    yofs := 0;
    pl := screen.pixels;
    pb := pointer(tempbuf);
    for i := 0  to 480-1 do
    begin
        for j := 0 to 640 - 1 do
        begin
             c := round(i * 0.95);
             c := (c + 12)*640;
             inc(c,round(j * 0.95));
             inc(c,16);
             pl[yofs + j] := blend_avg(pl[yofs + j], pb[c]);
        end;
        inc(yofs, screen^.pitch shr 2);
    end;

end;

Title: Re: need some optimization
Post by: Leledumbo on March 22, 2013, 10:00:47 pm
Surprisingly, changing blend_avg procedure to:
Code: [Select]
function blend_avg(const source,target:longword):longword;
var
   sourcer,sourceg,sourceb,targetr,targetg,targetb : longword;
begin
    sourcer := (source shr  0) and $ff;
    sourceg := (source shr  8) and $ff;
    sourceb := (source shr 16) and $ff;
    targetr := (target shr  0) and $ff;
    targetg := (target shr  8) and $ff;
    targetb := (target shr 16) and $ff;

    targetr := (sourcer + targetr) shl 1;
    targetg := (sourceg + targetg) shl 1;
    targetb := (sourceb + targetb) shl 1;

    result := (targetr shl  0) or (targetg shl  8) or (targetb shl 16);
end;
boosts the code 3-4 times! Now I get 120 fps :o
Title: Re: need some optimization
Post by: airpas on July 02, 2013, 03:55:08 pm
hi again
ok this is old post , we was talking about how to speed up float calculation with fpc ,i though the only way to make real optimization is by using asm directly , but i was wrong , we can use gcc instead

ok what i did is compiling 2 functions (blend_avg , scaleblit) using  gcc with -O3  -ffast-math -msse2 -mfpmath=sse switchs , then link the generated object with FPC .

what make me amazed is now FPC version is faster than gcc one with 15%

i add the attachments
Title: Re: need some optimization
Post by: User137 on July 02, 2013, 05:20:37 pm
I noticed still line like this:
Code: [Select]
trunc(320 + sin(d * 0.0034) * sin(d * 0.0134) * 300)it's actually 320.0 floating point sum, pascal just does the conversion automatically. It might optimize tiny bit if you sum integers instead of floats
Code: [Select]
320 + trunc(sin(d * 0.0034) * sin(d * 0.0134) * 300.0)
Another small thing you can do is calculate (tick * 0.2 + i) into variable first, because you do that 3 times.

And lastly i wonder if you actually have to draw single pixels with SDL. Isn't there functions which can optimally draw whole images with 1 command?
Title: Re: need some optimization
Post by: airpas on July 02, 2013, 06:40:49 pm
Quote
's actually 320.0 floating point sum, pascal just does the conversion automatically. It might optimize tiny bit if you sum integers instead of floats

yes you're right , but since this loop is just 128 time it doesn't metter .

Quote
And lastly i wonder if you actually have to draw single pixels with SDL. Isn't there functions which can optimally draw whole images with 1 command?
there are alot of situation where u need to draw single pixel (raytracing for example)
TinyPortal © 2005-2018