Recent

Author Topic: Loop unrolling  (Read 5838 times)

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11453
  • FPC developer.
Re: Loop unrolling
« Reply #15 on: July 26, 2020, 01:28:01 pm »
I assume it unrolls with some complexity heuristic that is inspired by uops cache size. Unrolling loops to larger than the uop size can be detrimental.

ASerge

  • Hero Member
  • *****
  • Posts: 2242
Re: Loop unrolling
« Reply #16 on: July 26, 2020, 08:07:51 pm »
My question was not answered yet, if that limit of 60 is configurable  :)
And 60 is not the limit. In this program (Lazarus 2.0.10 x64, Windows) the loop is unrolled 180 times!
Code: Pascal  [Select][+][-]
  1. {$APPTYPE CONSOLE}
  2. {$MODE OBJFPC}
  3. procedure Print;
  4. const
  5.   NumberOfCalls: Integer = 1;
  6. begin
  7.   Writeln(NumberOfCalls);
  8.   Inc(NumberOfCalls);
  9. end;
  10.  
  11. var
  12.   Index: SizeInt;
  13. begin
  14.   {$OPTIMIZATION LOOPUNROLL}
  15.   for Index := 1 to 180 do
  16.     Print;
  17.   Readln;
  18. end.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 9870
  • Debugger - SynEdit - and more
    • wiki
Re: Loop unrolling
« Reply #17 on: July 26, 2020, 09:40:53 pm »
And 60 is not the limit. In this program (Lazarus 2.0.10 x64, Windows) the loop is unrolled 180 times!
Does it perform better with unrolled?

I mean 160 copies of the call, must require some additional cache lines to be loaded?

I know there is a call to other code, that is in a different cacheline anyway, but I would expect, if the callee is small enough, that the callers cache line would still be there on return, or not?

PascalDragon

  • Hero Member
  • *****
  • Posts: 5481
  • Compiler Developer
Re: Loop unrolling
« Reply #18 on: July 26, 2020, 10:47:32 pm »
The number of unrolls is determined by the complexity of the loop's body. See optloop.number_unrolls.

Bi0T1N

  • Jr. Member
  • **
  • Posts: 85
Re: Loop unrolling
« Reply #19 on: July 27, 2020, 11:48:43 am »
The number of unrolls is determined by the complexity of the loop's body. See optloop.number_unrolls.
Wouldn't it make sense to update the code to support modern CPUs (x64) as well?
Code: Pascal  [Select][+][-]
  1. {$ifdef i386}
  2.         { multiply by 2 for CPUs with a long pipeline }
  3.         if current_settings.optimizecputype in [cpu_Pentium4] then
  4.           number_unrolls:=trunc(round((60+(60*ord(node_count_weighted(node)<15)))/max(node_count_weighted(node),1)))
  5.         else
  6. {$endif i386}
  7.           number_unrolls:=trunc(round((30+(60*ord(node_count_weighted(node)<15)))/max(node_count_weighted(node),1)));
  8.  
« Last Edit: July 27, 2020, 11:50:33 am by Bi0T1N »

Thaddy

  • Hero Member
  • *****
  • Posts: 14373
  • Sensorship about opinions does not belong here.
Re: Loop unrolling
« Reply #20 on: July 27, 2020, 12:49:07 pm »
No, because loop unrolling is algorithmic and not cpu dependent.
IOW it should be benificial on any cpu, up to a certain point which is determenistic.
Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

PascalDragon

  • Hero Member
  • *****
  • Posts: 5481
  • Compiler Developer
Re: Loop unrolling
« Reply #21 on: July 28, 2020, 10:06:30 am »
@Thaddy: as you can see in the code quoted above the amount of unrolling depends on the pipeline length of the CPU (together with the complexity of the code), specifying different CPU types might improve things.

BrunoK

  • Sr. Member
  • ****
  • Posts: 452
  • Retired programmer
Re: Loop unrolling
« Reply #22 on: July 28, 2020, 03:32:31 pm »
This is my measuring program I386 (you need to extract the urdtsc.pas from the .zip for timings) :
Code: Pascal  [Select][+][-]
  1. program pgmLoopUnroll;  { Build -O3 }
  2.  
  3. { https://forum.lazarus.freepascal.org/index.php/topic,50747.0.html }
  4.  
  5. {$APPTYPE CONSOLE}
  6. {$MODE OBJFPC}
  7.  
  8. uses SysUtils, uRdtsc;
  9.  
  10. procedure Print; { inline; }
  11. const
  12.   NumberOfCalls: Integer = 1;
  13. begin
  14.   Writeln(NumberOfCalls);
  15.   Inc(NumberOfCalls);
  16. end;
  17.  
  18. const
  19.   cIter : integer = 100;
  20. var
  21.   Index: SizeInt;
  22.   vIter: integer;
  23.   vStart, vStop, vRunningTotal: qWord; { RDTSC catches }
  24.   vName : String;
  25. {$DEFINE DoUnroll}
  26. label
  27.   _loop;
  28. begin
  29.   CheckCpuSpeed;
  30.   {$IFDEF DoUnroll}            // Timings in cpu ticks for 180 loops
  31.     {$OPTIMIZATION LOOPUNROLL}
  32.     vName :=  'LOOPUNROLL';    // 110739721 104781950 102358372  98791659
  33.   {$ELSE}
  34.     vName :=  'NO LOOPUNROLL'; // 106443809  99132142  98925764 114610215
  35.   {$ENDIF}
  36.   vRunningTotal := 0;
  37.   vIter := 0;
  38. _loop:
  39.     Sleep(0);
  40.     vStart := CPUTickStamp;
  41.     for Index := 1 to 180 do
  42.       Print;
  43.     vStop := CPUTickStamp;
  44.     vRunningTotal := vRunningTotal + vStop - vStart - CPUTickStampCost;
  45.     inc(vIter);
  46. {while vIter<cIter}
  47.     if vIter<cIter then {loop}
  48.       goto _loop;
  49.   WriteLn(vName, ' Average "for index := 1 to 180 :', Round(vRunningTotal / cIter));
  50.   Readln;
  51. end.
Conclusions :combinations of inline / loopunroll make no noticable difference in execution speed. loopunroll fattens the .exe and inline of print even more. All the time cost is realy in the WriteLn so this not a valid benchamrk for loopunroll.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11453
  • FPC developer.
Re: Loop unrolling
« Reply #23 on: July 28, 2020, 04:30:27 pm »
@Thaddy: as you can see in the code quoted above the amount of unrolling depends on the pipeline length of the CPU (together with the complexity of the code), specifying different CPU types might improve things.

Unrolling unnecessary might frustrate the uop cache  (since iirc Sandy Bridge), which means that modern CPUs can only dispatch 4 contrary to 6 instructions per cycle.



Bi0T1N

  • Jr. Member
  • **
  • Posts: 85
Re: Loop unrolling
« Reply #24 on: July 28, 2020, 09:01:54 pm »
No, because loop unrolling is algorithmic and not cpu dependent.
You should also read my post when replying. It depends on i386 (-> 32-bit) and additionally checks if it's compiled for a Pentium 4 CPU (cpu_Pentium4).

And every architecture has its own pipeline length (see below):
MicroarchitecturePipeline stages
Intel
P5 (Pentium)5
P6 (Pentium 3)10
P6 (Pentium Pro)14
NetBurst (Willamette)20
NetBurst (Northwood)20
NetBurst (Prescott)31
NetBurst (Cedar Mill)31
Core14
Bonnell16
Sandy Bridge14
Silvermont14 to 17
Haswell14
Skylake14
Kabylake14
AMD
Zen19
Zen219
ARM
ARM up to 73
ARM 8-95
ARM 118
Cortex A78-10
Cortex A813
Cortex A1515-25
Source: How long is a typical modern microprocessor pipeline? @ stackexchange
Seems like there is a bigger variance for older architectures but the trend is to have something between 14 and 25 stages in modern architectures.

However, the microarchitecture document from Agner Fog is saying that for Sandy Bridge and Ivy Bridge pipeline unnecessary loop unrolling should be avoided. In the chapter Bottlenecks in Skylakeand other Lakes he explicitly states
Quote from: Agner Fog
The μop cache is efficient for loops of up to approximately a thousand instructions.It is important to economize the use of the μop cache in CPU-intensive code. The difference in performance between loops that fit into the μop cache and loops that do not can be quite significant if the average instruction length is more than four bytes.Avoid unnecessary loop unrolling. The μop cache has the same weaknesses as earlier processors.
Therefore its not really deterministic nor easy to improve as it seems to depend highly on the loop code.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11453
  • FPC developer.
Re: Loop unrolling
« Reply #25 on: July 28, 2020, 09:57:03 pm »
IIRC uop in recent cpus are larger (Zen+ 2000 or so, Zen2 4000, Ice lake 2500). As said the consequence of invalidating the uop cache is reduced issuing of instructions from the frontend to the backend.

MathMan

  • Sr. Member
  • ****
  • Posts: 325
Re: Loop unrolling
« Reply #26 on: July 29, 2020, 04:48:49 pm »
IIRC uop in recent cpus are larger (Zen+ 2000 or so, Zen2 4000, Ice lake 2500). As said the consequence of invalidating the uop cache is reduced issuing of instructions from the frontend to the backend.

It's been some time that I looked into loop-unrolling, but i seem to remember that getting this right is really involved due to several influencing parameters.

1. usually it does not help if the loops unrolled generate more uops than the architecture can keep "in-flight" <= now we are talking about 200-300 (on the latest Intel & AMD generations iirc)
2. if there is a taken "call" inside the rolled loop then it usually also does not help to unroll <= so the example with a "Write(Ln)" in the loop shouldn't be unrolled
3. there is also the number of branches inside a loop that influences loop efficiency <= if a rolled-loop contains a branch then the unrolling should not extend beyond certain limits of branches in the unrolled loop.

Those are the ones that immediately sprang to my mind, but there were more, as usual, and exceptions to above, unavoidably it seems.

Kind regards,
Jens

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11453
  • FPC developer.
Re: Loop unrolling
« Reply #27 on: July 29, 2020, 06:00:12 pm »
IIRC uop in recent cpus are larger (Zen+ 2000 or so, Zen2 4000, Ice lake 2500). As said the consequence of invalidating the uop cache is reduced issuing of instructions from the frontend to the backend.

It's been some time that I looked into loop-unrolling, but i seem to remember that getting this right is really involved due to several influencing parameters.

1. usually it does not help if the loops unrolled generate more uops than the architecture can keep "in-flight" <= now we are talking about 200-300 (on the latest Intel & AMD generations iirc)

What is in flight? Work on simultaneously in the pipeline? Sure, that is probably the bigger factor, but not the only one. As said the uop caches are much larger (1500 for skylake, more for Ice Lake and Ryzen/Zen+ and Ryzen/Zen2).

These are also good for 
- longer instructions (most instruction decoders are limited of 16-byte worth of instructions per cycle and e.g. SSE instructions are quite long)
- very short parallelize instructions (from uop cache emit 6 uops to backend instead of 4 from the decoders)

Quote
2. if there is a taken "call" inside the rolled loop then it usually also does not help to unroll <= so the example with a "Write(Ln)" in the loop shouldn't be unrolled
3. there is also the number of branches inside a loop that influences loop efficiency <= if a rolled-loop contains a branch then the unrolling should not extend beyond certain limits of branches in the unrolled loop.

Here my experience nosedives. I got the bit knowledge that I have from crafting (SSE/AVX) image operations and analysing them with IACA, mostly basing myself on sources that implement kernel operations in SSE, where these usually don't happen.


MathMan

  • Sr. Member
  • ****
  • Posts: 325
Re: Loop unrolling
« Reply #28 on: July 30, 2020, 02:55:15 pm »
IIRC uop in recent cpus are larger (Zen+ 2000 or so, Zen2 4000, Ice lake 2500). As said the consequence of invalidating the uop cache is reduced issuing of instructions from the frontend to the backend.

It's been some time that I looked into loop-unrolling, but i seem to remember that getting this right is really involved due to several influencing parameters.

1. usually it does not help if the loops unrolled generate more uops than the architecture can keep "in-flight" <= now we are talking about 200-300 (on the latest Intel & AMD generations iirc)

What is in flight? Work on simultaneously in the pipeline? Sure, that is probably the bigger factor, but not the only one. As said the uop caches are much larger (1500 for skylake, more for Ice Lake and Ryzen/Zen+ and Ryzen/Zen2).

These are also good for 
- longer instructions (most instruction decoders are limited of 16-byte worth of instructions per cycle and e.g. SSE instructions are quite long)
- very short parallelize instructions (from uop cache emit 6 uops to backend instead of 4 from the decoders)

Yes - "in-flight" are instructions that have been forwarded to the schedulers but have not been retired yet by the retirement unit. And yes, the uop cache can forward 6 (or even more) uops to the scheduler, but in the majority of cases it is the retirement unit that determines overall throughput. The retirement unit (at least including the Skylake architecture) can only retire 4 uop per cycle (with very few exceptions like reg-reg moves that can bypass the execution units completely and be handled via the register renaming unit <= but only as long as one hasn't exhausted the register renaming buffers - 168 on Skylake). Zen2 has widened the retirement unit and is capable of retiring 5 uops per cycle (but again not general across the board).

Quote
2. if there is a taken "call" inside the rolled loop then it usually also does not help to unroll <= so the example with a "Write(Ln)" in the loop shouldn't be unrolled
3. there is also the number of branches inside a loop that influences loop efficiency <= if a rolled-loop contains a branch then the unrolling should not extend beyond certain limits of branches in the unrolled loop.

Here my experience nosedives. I got the bit knowledge that I have from crafting (SSE/AVX) image operations and analysing them with IACA, mostly basing myself on sources that implement kernel operations in SSE, where these usually don't happen.

You got exactly those types of core-loops that are essentially good to unroll with high gain :-)

I'd love to discuss this further but I am not sure if I would stay in the limits of this specific sub-board (or the Lazarus/FPC bulletin board in general)?

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11453
  • FPC developer.
Re: Loop unrolling
« Reply #29 on: July 31, 2020, 05:22:35 pm »
The retirement unit (at least including the Skylake architecture) can only retire 4 uop per cycle (with very few exceptions like reg-reg moves that can bypass the execution units completely and be handled via the register renaming unit <= but only as long as one hasn't exhausted the register renaming buffers - 168 on Skylake). Zen2 has widened the retirement unit and is capable of retiring 5 uops per cycle (but again not general across the board).

Yup, but e.g. many SIMD instructions are in the 4-7 bytes range, Skylake can only fetch 16 byte worth of instructions per cycle (and that's for both SMT threads). (Worse I saw FPC emitted alignment bytes for odd-numbered instructions, I wonder why if they are not branch targets.

Quote
Quote
Here my experience nosedives. I got the bit knowledge that I have from crafting (SSE/AVX) image operations and analysing them with IACA, mostly basing myself on sources that implement kernel operations in SSE, where these usually don't happen.

You got exactly those types of core-loops that are essentially good to unroll with high gain :-)

I'd love to discuss this further but I am not sure if I would stay in the limits of this specific sub-board (or the Lazarus/FPC bulletin board in general)?

Crafting asm is as much general programming as anything else, and assembler is an integral part of FPC, so I see no problem.


 

TinyPortal © 2005-2018