IIRC uop in recent cpus are larger (Zen+ 2000 or so, Zen2 4000, Ice lake 2500). As said the consequence of invalidating the uop cache is reduced issuing of instructions from the frontend to the backend.
It's been some time that I looked into loop-unrolling, but i seem to remember that getting this right is really involved due to several influencing parameters.
1. usually it does not help if the loops unrolled generate more uops than the architecture can keep "in-flight" <= now we are talking about 200-300 (on the latest Intel & AMD generations iirc)
2. if there is a taken "call" inside the rolled loop then it usually also does not help to unroll <= so the example with a "Write(Ln)" in the loop shouldn't be unrolled
3. there is also the number of branches inside a loop that influences loop efficiency <= if a rolled-loop contains a branch then the unrolling should not extend beyond certain limits of branches in the unrolled loop.
Those are the ones that immediately sprang to my mind, but there were more, as usual, and exceptions to above, unavoidably it seems.
Kind regards,
Jens