Well, first check how much of an improvement that actually gets. Is it noticeable?
Also, are all of you routines affected, or is the speed gain limited to some few procedures?
If it is only some code, you can even include both versions of the procedure, and then decide at runtime which to call.
For Intel (and likely AMD) play with "loopalign = 32". Especially for very small loops with lots of iterations. Doing it for all your code may be counter productive.