I posted a message in General lamenting that the ARM assembler source does not appear to do some pretty obvious optimization. Pascal Dragon suggested I post this as a "feature request" and accused @FPK or @Gareth as being possibly interested in such.
The post is:
https://forum.lazarus.freepascal.org/index.php/topic,60720.msg455185.html#msg455185Going forward I'm focused on 32 bit, but eventually 64 bit ARM (Rasp Pi to be specific).
Do I need optimized code? Probably not.
Do I want optimized code? Duh! Better, bigger, faster more.

always!
In all seriousness, there are some little itsy bitsy teeny weeny cases where either sampling two things close together or reducing senescence between them (a 25 Hz measurement that already has 20 - 40ms of delay) being merged with a 1000 Hz measurement with an as yet not determined delay.... would be nice.
(In this manner the fast noisy source being corrected with a faster noisier source will be coherenter. (that's a word - believe me - we take coherentism seriously here).
Anyway, the code (as I show in the link above) irked me in its lack of taking advantage of:
- constants being re-loaded when not needed
- perhaps bumping the indexing register rather than loading the new address into it (not that I'm sure this is actually faster).
and etc.
Of course my assembler days were all CISC and easy...