First of all, the most optimization potential lies in clever design of your code....
But, if you know there are no exceptions, and if you do use managed types (AnsiString / dyn array)
{$ImplicitExceptions off}
But, if an exception occurs this will leak memory. And as memory gets eaten up, side effects will get noticeable.
Avoid managed types, if you don't need them.
Choose a modern cpu type (CoreAvx or CoreI) if you don't need to support older cpu. (project options / target)
If you have a tight loop, in the middle of a long(er) routine, move the loop into a sub-routine (inlined) and call it.
At least in the past, this has sometimes helped the optimizer to do a better job with register allocations.
The classic: Move calculation out of the loop. Pre-calculate partial expressions, and only keep parts in the loop that depend on the loop counter.
For "SomeFoo[LoopCounter]" => use a pointer, and increase the pointer to the next item.
(though that one is in some cases redundant on modern cpu)
The very tricky bit, if you can align the start of small, but high-iteration-count loops to a 32 bit boundary => that can gain/loose a 2 digit percentage in speed.
Unfortunately, even functions are only aligned 16 bytes.
I have myself benchmarked code in the past. And just by changing the order in which procedures were declared (no other change), the speed varied by almost 20% to 30%.
At least on Intel. Because intel has some caches (IIRC for micro-code), that rely on 32 byte bounds.
So if you iterate some 1000 times over a loop, and if that loop has 32 or 64 bytes of code, then it runs fastest if it starts exactly on a 32 byte bound.
Unfortunately there is no option to enforce this. Maybe it can be done with asm blocks.