Recent

Author Topic: Loop unrolling  (Read 1920 times)

Thaddy

  • Hero Member
  • *****
  • Posts: 10293
Re: Loop unrolling
« Reply #30 on: July 31, 2020, 09:14:57 pm »
@Thaddy: as you can see in the code quoted above the amount of unrolling depends on the pipeline length of the CPU (together with the complexity of the code), specifying different CPU types might improve things.
That's why I wrote deterministic. When a certain CPU is given you determine which is best.
I am more like donkey than shrek

PascalDragon

  • Hero Member
  • *****
  • Posts: 1958
  • Compiler Developer
Re: Loop unrolling
« Reply #31 on: August 01, 2020, 10:45:26 am »
You wrote:

No, because loop unrolling is algorithmic and not cpu dependent.
IOW it should be benificial on any cpu, up to a certain point which is determenistic.

These two sentences contained "not cpu dependent" and also "benificial on any cpu" which leads to believe that you assume there to be a single upper limit for all CPUs.

MathMan

  • Full Member
  • ***
  • Posts: 205
Re: Loop unrolling
« Reply #32 on: August 02, 2020, 11:50:45 am »
Crafting asm is as much general programming as anything else, and assembler is an integral part of FPC, so I see no problem.

Fine for me. Before I start a deeper dive I admit that my hands on experience stops at Intels x86-64 Skylake architecture. So there might be changes in later architectures of Intel or AMD (or in the to me unknown realm of ARM, MIPS, Sparc, etc.) that modify the following slightly, but I do not see general deviations. Some statements relate to "hand-crafted" loop unrolling, but I try to keep the compiler perspective in mind - especially the issue of single pass compilation as used in FPC.

From an outside view one has two targets that can be addressed with loop unrolling:

A - reduction of loop overhead cost
B - increase utilization of CPU resources

From the Pascal language perspective there are two different types of loops:

1 - loops with iteration counts known at compile time
2 - loops with variable iteration counts, known at run time only

Lets start with A. The obvious overhead reduction is the elimination of branches when unrolling. It is clear that the improvement in execution speed depends on the ratio of work done in the rolled loop vs. the time required for the condition check and branch. Silly example

Code: [Select]
for Count1 := 1 to 100 do begin
  Inc( Count2, 5 );
end;

The above obviously will benefit heavily from loop-unrolling - but the unroll factor will vary with the optimization level used for compilation. If one hand-crafts unrolled loops then another point is elimination of condition code handling like

Code: [Select]
Loop:
  shr al, 1
  adc rbx, [mem]
  setc al

  add mem, 8
  dec rcx
  jne loop

which could be changed to

Code: [Select]
Loop:
  shr al, 1
  adc rbx, [mem]
  adc rbx, [mem+8]
  adc rbx, [mem+16]
  adc rbx, [mem+24]
  setc al

  add mem, 32
  sub rcx, 4
  jnc loop

However I currently fail to see how a single pass compiler could move to the second from the first, based on changes in the Pascal source. So this might not be too relevant for the discussion about loop unrolling in FPC.

A third aspect is that the overhead cost of the branch can only be reduced if the branch is predicted correct. Here the Branch-Target-Buffer comes into play, which can only hold a limited amount of targets for fast retrieval. A branch in this case is either a JCC or an indirect CALL <= latter I think is a quite common thing in Object Oriented code. The branch target buffer may be distorted if there are too many branches inside the unrolled loop, leading to mispredictions of the main loop branch. If you read the "Programmers Optimization" manuals from Intel or the documents from Agner Fog carefully you'll find more concrete limits for different x86-64 architectures.

I think I am going to split this here and address point B in a seperate response, as this is already becoming lengthy.

MathMan

  • Full Member
  • ***
  • Posts: 205
Re: Loop unrolling
« Reply #33 on: August 02, 2020, 12:48:16 pm »
Crafting asm is as much general programming as anything else, and assembler is an integral part of FPC, so I see no problem.

From an outside view one has two targets that can be addressed with loop unrolling:

A - reduction of loop overhead cost
B - increase utilization of CPU resources

B is where things become interesting (some might argue "involved" and "complex"). On a higher level one can say a CPU consists of a front-end (instruction fetch and decode), the execution units and a backend (retirement unit) - plus some cache, internal buffers, ... The retirement unit determines the maximum throughput of instructions through a CPU. For those architectures that I am aware of it is still <8 instructions per clock cycle that can be retired. This immediately implies diminishing return when unrolling with a factor >8! I know that there are cases where this has been exceeded (e.g. in Alexander Yee's Y-Cruncher you'll find unrolling factors of several 1000), but these cases are very, very rare and came to pass after exhaustive analysis.

In addition every functional unit in a CPU has it's own limitations that can not be bypassed however much you unroll. To get an understanding how many limitations there are I would urge you to take a look at Travis Downs really nice blog @ https://travisdowns.github.io/. There is an entry titled "Performance Speed Limits" which summarizes a lot of these and how they can be leveraged.

Some takeaways for loop-unrolling in a nutshell

1: it is not necessary to unroll with factors >8 and in the majority of cases 4 already suffices <= exceptions only, if your core loop contains no CALL, no JCC, is arithmetic / logic stuff only and makes up for a substantial part of your overall execution time. In addition excessive unrolling can lead to instruction cache trashing and even reduce your throughput. (IIRC this was already mentioned in this thread). To expand a bit on this for Intel Core generations up to Sandy-/Ivy Bridge an unroll factor of 4 is sufficient.

2: if you have a loop that contains a fair part of memory operations (especially memory writes) you may not gain anything from unrolling, as these dominate the execution time. Again silly example

Code: [Select]
for Count := 1 to 100 do begin
  ByteArrA[ Count ] := ByteArrB[ Count ];
end;

You will not get below 1c per BYTE in the above regardless of how much you unroll - the memory store timing of 1c per BYTE dominates the rest. The only way to speed this up is by reducing the number of stores, which is not loop unrolling, but auto vectorization then.

3: if you have a strong dependency chain within the loop body (or even worse across the loop iterations) then unrolling will probably also not gain you anything.

4: loops with function activation should also not be unrolled unless the function is "small" and can be inlined. Otherwise unrolling will again only generate code bloat with negative effect on the I-cache.

5: "branchy" code should not be unrolled with higher unroll factors (e.g. >=4)

Taking all the above into account the positive effects of loop unrolling are reduced to a fairly small subset of loops on average. Fairly small loops (less than 1 max 2 dozend assembler instructions) containing no CALL, no JCC (or indirect CALL), and doing arithmetic / logic only.

MathMan

  • Full Member
  • ***
  • Posts: 205
Re: Loop unrolling
« Reply #34 on: August 02, 2020, 01:14:55 pm »
Crafting asm is as much general programming as anything else, and assembler is an integral part of FPC, so I see no problem.

From the Pascal language perspective there are two different types of loops:

1 - loops with iteration counts known at compile time
2 - loops with variable iteration counts, known at run time only


Regarding 1 - this is relatively straight forward as long as the unrolling remains in the limits I described in the other posts. However this will require generating either prologue or epilogue code for the unrolled loop to whipe up the remaining loop iterations. E.g.

Code: [Select]
for count := 1 to 13 do begin
  ... do something ...
end;

could be translated to

Code: [Select]
for count := 1 to 3 do begin
  ... do something[ 1 ] ...
  ... do something[ 2 ] ...
  ... do something[ 3 ] ...
  ... do something[ 4 ] ...
end;

  ... do something ...

With respect to the above 2 is the more tricky case. Because the compiler does not know the number of iterations of the loop at compile time it might choose to unroll e.g. with a factor of 4 (by whatever heuristic). But at run time the loop is only executed with iteration count 3 and due to that the execution slows down! This can be addressed by Feedback Optimization (and IIRC FPC does support FBO to a degree) but this is a major deviation from the single pass compile approach.

If I summarize A, B, 1 and 2 my personal take on loop unrolling is

* I do not favour a generic loop-unrolling directive used across a unit (or even a complete program)

  * there simply are to many cases where things can go wrong with loop unrolling <= mind, for a single pass compiler approach

* I favour a directive that is only applied to the next loop, ideally it should be placed directly above the loop

  * the directive should allow definition of the unroll factor <= none given, loop won't get unrolled
  * a programmer should always first examing the Pascal source and the generated asm of a loop <= whishful thinking I know ;-)

Cheers,
MathMan

 

TinyPortal © 2005-2018