I've added a warning on the LLVM wiki page that should solve the problem you encountered. FWIW, since the LLVM backend is also supported for macOS/x86-64 and Linux/x86-64, you can also try it there. I expect less big performance gains there, although that depends on how much LLVM is able to vectorise. You may want to use -Cfavx2 there as well (also when using plain FPC, although I don't expect too big of an impact there).
Edit: you may also want to compile the AArch64 version using the FPC code generator with the options -O2 -Oonopeephole (or -O3 -Oonopeephole, if you prefer). That will enable, a.o., register variables and CSE but disable the (buggy) peephole optimiser. That way you'll still get most performance gains that FPC has to offer (I doubt the peephole optimiser makes a big difference at this point). Another option you may want to try, both with and without LLVM, is -Oofastmath (it's not part of -O3).