I made a simulation program whose core is encapsulated in Class TSim.
For optimum performance on multi core processors, I run two instances of TSim in separate threads, and combine their results after they have finished. In comparison to single thread operation, I get 50%...90% performance gain on a dual core CPU.
The simulation needs a lot of random numbers, therefore I replaced FPC's own Mersenne Twister by a simple but very fast random generator, encapsulated in object TRandGen.
I started with a single global instance of TRandGen, which is probably not the best idea. Manipulating RandGen's internal state by two concurrent processes might be detrimental to the generators properties... not sure. I expect that FPC's random generator uses critical sections to avoid that.
But that is not the issue. What puzzled me is that the global RandGen seems to cause a bottleneck, and performance breaks down to almost single core operation.
I can easily get round it by making the instances of TRandGen local to TSim. But why does the standard FPC generator, which is global in unit systems, not seem to cause this bottleneck ?