I think it has nothing to do at all with the speed of writeln, and it's again a case of micro benchmarking gone awry. Here were my initial results with ppc386 on OS X with plain -O2 (with the "time" command, so background activity is eliminated):
2.6.4, with writeln
user 0m5.572s
sys 0m0.293s
2.6.4, no writeln
user 0m4.648s
sys 0m0.005s
3.1.1, with writeln
user 0m2.243s
sys 0m0.266s
3.1.1, no writeln
user 0m2.038s
sys 0m0.004s
So for me, writeln actually seems to have less overhead in 2.6.4 than on 3.1.1. Then I figured to test the overhead with cwstring included, and it seemed to get really weird:
2.6.4, with writeln, with cwstring
user 0m4.216s
sys 0m0.269s
2.6.4, no writeln, with cwstring
user 0m4.020s
sys 0m0.006s
3.1.1, with writeln, with cwstring
user 0m1.427s
sys 0m0.254s
3.1.1, no writeln, with cwstring
user 0m1.275s
sys 0m0.004s
Look at that: including cwstring makes the program a lot faster both on 2.6.4 and 3.1.1, even when there's no input/output at all! If you see something like that, you can be virtually certain it's a case of memory alignment.
And indeed: if I move all of the code of the program into a subroutine (so all variables become local variables) and then play with the maximum alignment for local variables (all cases without writeln, with/without cwstring stays the same):
3.1.1, -Oalocalmax=4
user 0m1.233s
sys 0m0.004s
3.1.1, -Oalocalmax=8 (and anything else > 4, doesn't change generated code compared to 8)
user 0m3.843s
sys 0m0.014s
So as soon as the maximum alignment for locals is increased above 8, you get a huge increase in run time here. So it's clear that there's a cache effect playing somewhere, because forcing the alignment to 4 bytes means that several doubles are now only aligned at 4 bytes (which in theory should reduce performance). In the original code, all variables were global variables and hence including another unit will affect their alignment/placement too.
Now, all of the above is without -Cfsse2. If I add -Cfsse2, then the speed is the same with 4 and 8 byte alignment. Reason: all values are kept in SSE2 registers, so the stack alignment is irrelevant.
Now, if you add writeln, then the impact will again become potentially bigger because then the sse values will have to be spilled to the stack and hence cache effects come back into play. Another thing that may be relevant is that if FPC is able to better optimise the register-based code in 3.1.1 than in 2.6.4 (or is better able to put global variables into registers), then logically if values need to be spilled the performance degredation will be relatively larger in 3.1.1 than in 2.6.4, but that could just be because 2.6.4 generated worse code to start with.
And now I've spent way more time on this already than I ever planned to.