Something's fishy about this benchmark.
With MAX=99999999, I get 6 seconds here with FPC3.4.2, approximately the same as the OP. To see what causes the load, I remove the inner cos(tanh(x)) so that only the sine remains, and then get 3 seconds. So the load is dominated by processing the trigonometric calls.
Looking at the assembler, the CPU is really only calling FSIN repeatedly; compiler optimisation settings don't matter, neither the FPU type. So I cannot see how any language could run this faster, except by implicit parallelisation which I don't believe ADA can do.
I suspect that the OP has run the benchmark with different MAX values; in the first post he writes that ADA runs in under 3 seconds with MAX=999999, but in the benchmark MAX=99999999 which ADA allegedly does in 1.5 seconds...