Today, I don't use asm but for very specific tasks using specific opcodes, like SSE2, AES-NI, AVX or SSE4.2.
... and you will find thousands of lines of such manually tuned asm in mORMot, e.g. just for x86_64
https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.base.asmx64.inc and
https://github.com/synopse/mORMot2/blob/master/src/crypt/mormot.crypt.core.asmx64.inc or even
https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.fpcx64mm.pasOtherwise, especially if I want the FPC compiler to inline, I write pascal code.
With some tricks like pointer arithmetic for some core routines.
And latest versions of FPC tends to generate very good code.
The pointer() trick is to be used for this example. It will be properly inlined, it will be really cross-platform and cross-CPU, and it will be faster than manual non-inline asm.
To be fair, inlining asm would need more than... inlining... to be efficient.
You will need proper register allocation by the compiler, therefore you would need something closer to C/C++ intrinsics.
In real projects, inlining asm is used in a very few places. Intrinsics are the way to go. Or write the asm in high-level languages like Perl or other DSL.
So don't make premature optimization.
Root of all evil.
And trust the RTL and some tuned libraries to be fast enough for most of the purposes.