Recent

Author Topic: inlining slows execution  (Read 408 times)

MountainQ

  • New Member
  • *
  • Posts: 37
inlining slows execution
« on: August 16, 2019, 10:58:04 am »
Hello everyone,

the topic of execution speed might be overrated, yet I (and probably others as well) are somewhat attracted to it.
My question concerns the inline modifier; when I use a nested 'inlined' procedure the execution slows significantly down compared to the situation where the code of the nested procedure is copied into the hosting procedure.
The code using the nested procedure looks as follows:
Code: Pascal  [Select]
  1. {$INLINE on}  
  2.  
  3. procedure Execute();
  4. var
  5.   ...
  6.   procedure update; inline;
  7.   begin
  8.     ...
  9.   end;    
  10. begin
  11.   while somecondition
  12.   begin
  13.     ...
  14.     update();
  15.     ...
  16.   end;
  17. end;
I am using FPC 3.0.4/Lazarus 2.0.4; thanks for the new release!
I do not get a warning that the procedure cannot be inlined.
Is this an accepted/known issue? Can nested procedures not be inlined?
Best to everyone



marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7312
Re: inlining slows execution
« Reply #1 on: August 16, 2019, 12:59:25 pm »
If the code in update() generates too much assembler, that is possible yes.

Also, you did enable a bit of optimization? That is sometimes needed to clear away some of the glue code around the inline.

Is the code executed every iteration?

julkas

  • Sr. Member
  • ****
  • Posts: 312
  • KISS principle / Lazarus 2.0.0 / FPC 3.0.4
Re: inlining slows execution
« Reply #2 on: August 16, 2019, 01:11:34 pm »
Can you try following code with inline (must be 2 times faster) and without.
Code: Pascal  [Select]
  1. program test;
  2. {$mode objfpc}
  3. {$optimization on}
  4. {$inline on}
  5. procedure l0;
  6. var
  7.   i: longint;
  8.   procedure l1(v: longint); //inline;
  9.   begin
  10.     v := (v*v*v) div 10007;
  11.   end;
  12.   begin
  13.     for i:=0 to 100000000 do
  14.       l1(i);
  15.   end;
  16. begin
  17. l0;
  18. writeln('ok');
  19. end.
procedure mulu64(a, b: QWORD; out clo, chi: QWORD); assembler;
asm
  mov rax, a
  mov rdx, b
  mul rdx
  mov [clo], rax
  mov [chi], rdx
end;
(* Pointer game *) Inc(ptr, 1); (* vs *) ptr := ptr + 1;

Thaddy

  • Hero Member
  • *****
  • Posts: 8511
Re: inlining slows execution
« Reply #3 on: August 16, 2019, 01:46:46 pm »
{$optimization on} equals -O2, try again with -O4.
Btw on arm, without timing but based on the assembler it is definitely faster with  FPC -CX -XXs  -O4 -Sv  -a

Note - again - aligning is essential I used {$IOCHECKS OFF}{$CODEALIGN LOOP=4}{$OPTIMIZATION LEVEL4}

Note loop unrolls are not part of O2 I think. (and the loop is too big)

And this is near optimal:
Code: ASM  [Select]
  1. # Peephole Add/Sub to Preindexed done
  2.         str     r0,[r13, #-8]!
  3.         mul     r0,r1,r1
  4. # tcgarm.a_mul_reg_reg_pair called
  5.         ldr     r3,.Lj11
  6.         mul     r0,r1,r0
  7.         smull   r2,r1,r0,r3
  8. # Peephole FoldShiftProcess done
  9.         mov     r3,r0,lsr #31
  10.         add     r1,r3,r1,asr #12
  11.         add     r13,r13,#8
  12.         bx      r14
  13. .Lj11:
  14.         .long   1757988013
  15. .Le1:
  16.  

You can further experiment with the codealign, depending on your processor that has impact.
« Last Edit: August 16, 2019, 02:10:36 pm by Thaddy »
Read the manuals and if you are a professional get a proper education in computer science. Makes the forum a lot cleaner.

MountainQ

  • New Member
  • *
  • Posts: 37
Re: inlining slows execution
« Reply #4 on: August 16, 2019, 03:00:41 pm »
Thank you all very much for the rapid replies; luckily I am more excited than in a real rush.
The inlined procedure updates a list of recent values and its variance:
Code: Pascal  [Select]
  1. procedure InlineExecute();
  2. var
  3.   i, count: integer;
  4.   delta, sumx, sumx2, noise, invc: double;
  5.   recent: array of double;
  6.   pi, pf: PDouble;
  7.  
  8.   procedure update_noise; inline;
  9.   begin
  10.     sumx += delta-pi^;
  11.     sumx2 += sqr(delta)-sqr(pi^);
  12.     pi^ := delta;
  13.     noise := invc*abs(sumx2*count-sqr(sumx));
  14.     if pi < pf then
  15.       inc(pi)
  16.     else
  17.       pi := @recent[0];
  18.   end;
  19.  
  20. begin
  21.   count := 10000;
  22.   invc := 1/sqr(count);
  23.   SetLength(recent, count);
  24.   pi := @recent[0];
  25.   pf := @recent[Length(recent)-1];
  26.  
  27.   for i := 0 to round(1E8) do
  28.   begin
  29.     delta := i;
  30.     update_noise;
  31.     {sumx += delta-pi^;
  32.     sumx2 += sqr(delta)-sqr(pi^);
  33.     pi^ := delta;
  34.     noise := invc*abs(sumx2*count-sqr(sumx))/sqr(count);
  35.     if pi < pf then
  36.       inc(pi)
  37.     else
  38.       pi := @recent[0];}
  39.   end;
  40. end;            
  41.  

@marcov: is that too long?
@julkas: thanks for the snippet; I also get a boost of (almost) a factor of 2; still it becomes a bit faster when you copy the code into the enclosing procedure.
@thaddy: I played a little bit with the optimization level (-O2..-O4) however had little effect I will look at the options you provided.

The above code is still (~10%) faster if no nested procedure is used. Of course there is more interesting stuff going on in the 'outer' procedure.


k1ng

  • New Member
  • *
  • Posts: 31
Re: inlining slows execution
« Reply #5 on: August 16, 2019, 04:18:56 pm »
Can anyone test this:

Code: Pascal  [Select]
  1. program test;
  2. {$mode objfpc}
  3. {$optimization on}
  4. {$inline on}
  5.  
  6. procedure directfunction;
  7. var
  8.   i, v: longint;
  9. begin
  10.         for i:=0 to 100000000 do
  11.           v := (i*i*i) div 10007;
  12. end;
  13.  
  14. procedure l1(v: longint); //inline; <--comment/uncomment
  15. begin
  16.   v := (v*v*v) div 10007;
  17. end;
  18.  
  19. procedure funcinline;
  20. var
  21.   i: longint;
  22. begin
  23.         for i:=0 to 100000000 do
  24.           l1(i);
  25. end;
  26.  
  27. begin
  28.   directfunction;
  29.   funcinline;
  30.   writeln('ok');
  31. end.

Maybe it's a problem with functions in functions?
Maybe one should create a bug report...

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 5524
    • wiki
Re: inlining slows execution
« Reply #6 on: August 16, 2019, 04:45:06 pm »
Compile your code (with AND without inline) using -al as compiler option.

Keep the generated assembler files, and compare them. Then you can see if anything got inlined. (And you can also copy the code into place, and see if that is any different)

Thaddy

  • Hero Member
  • *****
  • Posts: 8511
Re: inlining slows execution
« Reply #7 on: August 16, 2019, 05:52:09 pm »
<warning: feature is for freaks>
It is a known fact that if the nested procedure accesses local variables declared  above it, it is not optimal. Make sure you align the stack ({$codealign localmin=<usually native pointer size>}
If you declare a procedure outside and declare it isnested it is more optimized. If it does not use any local parent variables at all performance is maximal. As is a local procedure that takes parameters and is declared above the local variables.. Note  Isnested needs {$modeswitch nestedprocvars}, see the user manual.

FPC can at least do something about it. In e.g. Delphi there are the same speed problems with local procedures declared after local variables and they can't be fixed!

The cause is that local procedures (entry points at least) are also allocated on the stack and the stack is byte aligned by default. That means you can also pad the stack, but the localmin will do that for you.
Thus you can make sure the procedure is called on a natural boundary. Try it!, this has huge effect. IsNested works afaik similar, but takes some getting used to and access to local vars is difficult.

This may be not totally correct, but it is close and aligning the stack works magic.

For beginners on the subject or as a guideline:
32 bit platform: {$codealign localmin=4}
64 bit platform: {$codealign localmin=8}

Codealign is a local switch, so can be used with {$push}/{$pop} on a very finely grained basis.
« Last Edit: August 16, 2019, 06:09:30 pm by Thaddy »
Read the manuals and if you are a professional get a proper education in computer science. Makes the forum a lot cleaner.

BOSHU

  • Newbie
  • Posts: 1
Re: inlining slows execution
« Reply #8 on: August 16, 2019, 06:10:23 pm »
Probably it's an allignment problem.
See https://bugs.freepascal.org/view.php?id=26089

Thaddy

  • Hero Member
  • *****
  • Posts: 8511
Re: inlining slows execution
« Reply #9 on: August 16, 2019, 07:06:47 pm »
Yes, but can be mitigated as I explained. AFAIK you can not do that in Delphi.
(Wasn't it that it - ($codealign localmin} etc - was introduced because of that?)
That report discusses 2.6.2
« Last Edit: August 16, 2019, 07:34:10 pm by Thaddy »
Read the manuals and if you are a professional get a proper education in computer science. Makes the forum a lot cleaner.