Loop unrolling

marcov

Administrator
Hero Member
Posts: 11453
FPC developer.

Re: Loop unrolling

« Reply #15 on: July 26, 2020, 01:28:01 pm »

I assume it unrolls with some complexity heuristic that is inspired by uops cache size. Unrolling loops to larger than the uop size can be detrimental.

Logged

Martin_fr

Administrator
Hero Member
Posts: 9870
Debugger - SynEdit - and more

Re: Loop unrolling

« Reply #17 on: July 26, 2020, 09:40:53 pm »

Quote from: ASerge on July 26, 2020, 08:07:51 pm

And 60 is not the limit. In this program (Lazarus 2.0.10 x64, Windows) the loop is unrolled 180 times!

Does it perform better with unrolled?

I mean 160 copies of the call, must require some additional cache lines to be loaded?

I know there is a call to other code, that is in a different cacheline anyway, but I would expect, if the callee is small enough, that the callers cache line would still be there on return, or not?

Logged

From the wiki: Ide Tools, Code completion and more / IDE cool features / Debugger Status

PascalDragon

Hero Member
Posts: 5481
Compiler Developer

Re: Loop unrolling

« Reply #18 on: July 26, 2020, 10:47:32 pm »

The number of unrolls is determined by the complexity of the loop's body. See optloop.number_unrolls.

Logged

Bi0T1N

Jr. Member
Posts: 85

Re: Loop unrolling

« Reply #19 on: July 27, 2020, 11:48:43 am »

Quote from: PascalDragon on July 26, 2020, 10:47:32 pm

The number of unrolls is determined by the complexity of the loop's body. See optloop.number_unrolls.

Wouldn't it make sense to update the code to support modern CPUs (x64) as well?

Code: Pascal [Select][+]

{$ifdef i386}
        { multiply by 2 for CPUs with a long pipeline }
        if current_settings.optimizecputype in [cpu_Pentium4] then
          number_unrolls:=trunc(round((60+(60*ord(node_count_weighted(node)<15)))/max(node_count_weighted(node),1)))
        else
{$endif i386}
          number_unrolls:=trunc(round((30+(60*ord(node_count_weighted(node)<15)))/max(node_count_weighted(node),1)));
 

« Last Edit: July 27, 2020, 11:50:33 am by Bi0T1N »

Logged

Thaddy

Hero Member
Posts: 14373
Sensorship about opinions does not belong here.

Re: Loop unrolling

« Reply #20 on: July 27, 2020, 12:49:07 pm »

No, because loop unrolling is algorithmic and not cpu dependent.
IOW it should be benificial on any cpu, up to a certain point which is determenistic.

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

PascalDragon

Hero Member
Posts: 5481
Compiler Developer

Re: Loop unrolling

« Reply #21 on: July 28, 2020, 10:06:30 am »

@Thaddy: as you can see in the code quoted above the amount of unrolling depends on the pipeline length of the CPU (together with the complexity of the code), specifying different CPU types might improve things.

Logged

BrunoK

Sr. Member
Posts: 452
Retired programmer

Re: Loop unrolling

« Reply #22 on: July 28, 2020, 03:32:31 pm »

This is my measuring program I386 (you need to extract the urdtsc.pas from the .zip for timings) :

Code: Pascal [Select][+]

program pgmLoopUnroll;  { Build -O3 }
 
{ https://forum.lazarus.freepascal.org/index.php/topic,50747.0.html }
 
{$APPTYPE CONSOLE}
{$MODE OBJFPC}
 
uses SysUtils, uRdtsc;
 
procedure Print; { inline; }
const
  NumberOfCalls: Integer = 1;
begin
  Writeln(NumberOfCalls);
  Inc(NumberOfCalls);
end;
 
const
  cIter : integer = 100;
var
  Index: SizeInt;
  vIter: integer;
  vStart, vStop, vRunningTotal: qWord; { RDTSC catches }
  vName : String;
{$DEFINE DoUnroll}
label
  _loop;
begin
  CheckCpuSpeed;
  {$IFDEF DoUnroll}            // Timings in cpu ticks for 180 loops
    {$OPTIMIZATION LOOPUNROLL}
    vName :=  'LOOPUNROLL';    // 110739721 104781950 102358372  98791659
  {$ELSE}
    vName :=  'NO LOOPUNROLL'; // 106443809  99132142  98925764 114610215
  {$ENDIF}
  vRunningTotal := 0;
  vIter := 0;
_loop:
    Sleep(0);
    vStart := CPUTickStamp;
    for Index := 1 to 180 do
      Print;
    vStop := CPUTickStamp;
    vRunningTotal := vRunningTotal + vStop - vStart - CPUTickStampCost;
    inc(vIter);
{while vIter<cIter}
    if vIter<cIter then {loop}
      goto _loop;
  WriteLn(vName, ' Average "for index := 1 to 180 :', Round(vRunningTotal / cIter));
  Readln;
end.

Conclusions :combinations of inline / loopunroll make no noticable difference in execution speed. loopunroll fattens the .exe and inline of print even more. All the time cost is realy in the WriteLn so this not a valid benchamrk for loopunroll.

urdtsc.zip (1.03 kB - downloaded 43 times.)

Logged

marcov

Administrator
Hero Member
Posts: 11453
FPC developer.

Re: Loop unrolling

« Reply #23 on: July 28, 2020, 04:30:27 pm »

Quote from: PascalDragon on July 28, 2020, 10:06:30 am

@Thaddy: as you can see in the code quoted above the amount of unrolling depends on the pipeline length of the CPU (together with the complexity of the code), specifying different CPU types might improve things.

Unrolling unnecessary might frustrate the uop cache (since iirc Sandy Bridge), which means that modern CPUs can only dispatch 4 contrary to 6 instructions per cycle.

Logged

Bi0T1N

Jr. Member
Posts: 85

Re: Loop unrolling

« Reply #24 on: July 28, 2020, 09:01:54 pm »

Quote from: Thaddy on July 27, 2020, 12:49:07 pm

No, because loop unrolling is algorithmic and not cpu dependent.

You should also read my post when replying. It depends on i386 (-> 32-bit) and additionally checks if it's compiled for a Pentium 4 CPU (cpu_Pentium4).

And every architecture has its own pipeline length (see below):

Microarchitecture	Pipeline stages
Intel
P5 (Pentium)	5
P6 (Pentium 3)	10
P6 (Pentium Pro)	14
NetBurst (Willamette)	20
NetBurst (Northwood)	20
NetBurst (Prescott)	31
NetBurst (Cedar Mill)	31
Core	14
Bonnell	16
Sandy Bridge	14
Silvermont	14 to 17
Haswell	14
Skylake	14
Kabylake	14
AMD
Zen	19
Zen2	19
ARM
ARM up to 7	3
ARM 8-9	5
ARM 11	8
Cortex A7	8-10
Cortex A8	13
Cortex A15	15-25

Source: How long is a typical modern microprocessor pipeline? @ stackexchange
Seems like there is a bigger variance for older architectures but the trend is to have something between 14 and 25 stages in modern architectures.

However, the microarchitecture document from Agner Fog is saying that for Sandy Bridge and Ivy Bridge pipeline unnecessary loop unrolling should be avoided. In the chapter Bottlenecks in Skylakeand other Lakes he explicitly states

Quote from: Agner Fog

The μop cache is efficient for loops of up to approximately a thousand instructions.It is important to economize the use of the μop cache in CPU-intensive code. The difference in performance between loops that fit into the μop cache and loops that do not can be quite significant if the average instruction length is more than four bytes.Avoid unnecessary loop unrolling. The μop cache has the same weaknesses as earlier processors.

Therefore its not really deterministic nor easy to improve as it seems to depend highly on the loop code.

Logged

marcov

Administrator
Hero Member
Posts: 11453
FPC developer.

Re: Loop unrolling

« Reply #25 on: July 28, 2020, 09:57:03 pm »

IIRC uop in recent cpus are larger (Zen+ 2000 or so, Zen2 4000, Ice lake 2500). As said the consequence of invalidating the uop cache is reduced issuing of instructions from the frontend to the backend.

Logged

MathMan

Sr. Member
Posts: 325

Re: Loop unrolling

« Reply #26 on: July 29, 2020, 04:48:49 pm »

Quote from: marcov on July 28, 2020, 09:57:03 pm

IIRC uop in recent cpus are larger (Zen+ 2000 or so, Zen2 4000, Ice lake 2500). As said the consequence of invalidating the uop cache is reduced issuing of instructions from the frontend to the backend.

It's been some time that I looked into loop-unrolling, but i seem to remember that getting this right is really involved due to several influencing parameters.

1. usually it does not help if the loops unrolled generate more uops than the architecture can keep "in-flight" <= now we are talking about 200-300 (on the latest Intel & AMD generations iirc)
2. if there is a taken "call" inside the rolled loop then it usually also does not help to unroll <= so the example with a "Write(Ln)" in the loop shouldn't be unrolled
3. there is also the number of branches inside a loop that influences loop efficiency <= if a rolled-loop contains a branch then the unrolling should not extend beyond certain limits of branches in the unrolled loop.

Those are the ones that immediately sprang to my mind, but there were more, as usual, and exceptions to above, unavoidably it seems.

Kind regards,
Jens

Logged

marcov

Administrator
Hero Member
Posts: 11453
FPC developer.

Re: Loop unrolling

« Reply #27 on: July 29, 2020, 06:00:12 pm »

Quote from: MathMan on July 29, 2020, 04:48:49 pm

Quote from: marcov on July 28, 2020, 09:57:03 pm
IIRC uop in recent cpus are larger (Zen+ 2000 or so, Zen2 4000, Ice lake 2500). As said the consequence of invalidating the uop cache is reduced issuing of instructions from the frontend to the backend.

It's been some time that I looked into loop-unrolling, but i seem to remember that getting this right is really involved due to several influencing parameters.

1. usually it does not help if the loops unrolled generate more uops than the architecture can keep "in-flight" <= now we are talking about 200-300 (on the latest Intel & AMD generations iirc)

What is in flight? Work on simultaneously in the pipeline? Sure, that is probably the bigger factor, but not the only one. As said the uop caches are much larger (1500 for skylake, more for Ice Lake and Ryzen/Zen+ and Ryzen/Zen2).

These are also good for
- longer instructions (most instruction decoders are limited of 16-byte worth of instructions per cycle and e.g. SSE instructions are quite long)
- very short parallelize instructions (from uop cache emit 6 uops to backend instead of 4 from the decoders)

Quote

2. if there is a taken "call" inside the rolled loop then it usually also does not help to unroll <= so the example with a "Write(Ln)" in the loop shouldn't be unrolled
3. there is also the number of branches inside a loop that influences loop efficiency <= if a rolled-loop contains a branch then the unrolling should not extend beyond certain limits of branches in the unrolled loop.

Here my experience nosedives. I got the bit knowledge that I have from crafting (SSE/AVX) image operations and analysing them with IACA, mostly basing myself on sources that implement kernel operations in SSE, where these usually don't happen.

Logged

MathMan

Sr. Member
Posts: 325

Re: Loop unrolling

« Reply #28 on: July 30, 2020, 02:55:15 pm »

Quote from: marcov on July 29, 2020, 06:00:12 pm

Quote from: MathMan on July 29, 2020, 04:48:49 pm
Quote from: marcov on July 28, 2020, 09:57:03 pm
IIRC uop in recent cpus are larger (Zen+ 2000 or so, Zen2 4000, Ice lake 2500). As said the consequence of invalidating the uop cache is reduced issuing of instructions from the frontend to the backend.

It's been some time that I looked into loop-unrolling, but i seem to remember that getting this right is really involved due to several influencing parameters.

1. usually it does not help if the loops unrolled generate more uops than the architecture can keep "in-flight" <= now we are talking about 200-300 (on the latest Intel & AMD generations iirc)

What is in flight? Work on simultaneously in the pipeline? Sure, that is probably the bigger factor, but not the only one. As said the uop caches are much larger (1500 for skylake, more for Ice Lake and Ryzen/Zen+ and Ryzen/Zen2).

These are also good for
- longer instructions (most instruction decoders are limited of 16-byte worth of instructions per cycle and e.g. SSE instructions are quite long)
- very short parallelize instructions (from uop cache emit 6 uops to backend instead of 4 from the decoders)

Yes - "in-flight" are instructions that have been forwarded to the schedulers but have not been retired yet by the retirement unit. And yes, the uop cache can forward 6 (or even more) uops to the scheduler, but in the majority of cases it is the retirement unit that determines overall throughput. The retirement unit (at least including the Skylake architecture) can only retire 4 uop per cycle (with very few exceptions like reg-reg moves that can bypass the execution units completely and be handled via the register renaming unit <= but only as long as one hasn't exhausted the register renaming buffers - 168 on Skylake). Zen2 has widened the retirement unit and is capable of retiring 5 uops per cycle (but again not general across the board).

Quote from: marcov on July 29, 2020, 06:00:12 pm

Quote
2. if there is a taken "call" inside the rolled loop then it usually also does not help to unroll <= so the example with a "Write(Ln)" in the loop shouldn't be unrolled
3. there is also the number of branches inside a loop that influences loop efficiency <= if a rolled-loop contains a branch then the unrolling should not extend beyond certain limits of branches in the unrolled loop.

Here my experience nosedives. I got the bit knowledge that I have from crafting (SSE/AVX) image operations and analysing them with IACA, mostly basing myself on sources that implement kernel operations in SSE, where these usually don't happen.

You got exactly those types of core-loops that are essentially good to unroll with high gain :-)

I'd love to discuss this further but I am not sure if I would stay in the limits of this specific sub-board (or the Lazarus/FPC bulletin board in general)?

Logged

marcov

Administrator
Hero Member
Posts: 11453
FPC developer.

Re: Loop unrolling

« Reply #29 on: July 31, 2020, 05:22:35 pm »

Quote from: MathMan on July 30, 2020, 02:55:15 pm

The retirement unit (at least including the Skylake architecture) can only retire 4 uop per cycle (with very few exceptions like reg-reg moves that can bypass the execution units completely and be handled via the register renaming unit <= but only as long as one hasn't exhausted the register renaming buffers - 168 on Skylake). Zen2 has widened the retirement unit and is capable of retiring 5 uops per cycle (but again not general across the board).

Yup, but e.g. many SIMD instructions are in the 4-7 bytes range, Skylake can only fetch 16 byte worth of instructions per cycle (and that's for both SMT threads). (Worse I saw FPC emitted alignment bytes for odd-numbered instructions, I wonder why if they are not branch targets.

Quote

Quote
Here my experience nosedives. I got the bit knowledge that I have from crafting (SSE/AVX) image operations and analysing them with IACA, mostly basing myself on sources that implement kernel operations in SSE, where these usually don't happen.

You got exactly those types of core-loops that are essentially good to unroll with high gain :-)

I'd love to discuss this further but I am not sure if I would stay in the limits of this specific sub-board (or the Lazarus/FPC bulletin board in general)?

Crafting asm is as much general programming as anything else, and assembler is an integral part of FPC, so I see no problem.

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: Loop unrolling (Read 5838 times)

marcov

Re: Loop unrolling

ASerge

Re: Loop unrolling

Martin_fr

Re: Loop unrolling

PascalDragon

Re: Loop unrolling

Bi0T1N

Re: Loop unrolling

Thaddy

Re: Loop unrolling

PascalDragon

Re: Loop unrolling

BrunoK

Re: Loop unrolling

marcov

Re: Loop unrolling

Bi0T1N

Re: Loop unrolling

marcov

Re: Loop unrolling

MathMan

Re: Loop unrolling

marcov

Re: Loop unrolling

MathMan

Re: Loop unrolling

marcov

Re: Loop unrolling

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook