Here is another interesting finding concerning the performance of memory management and dynamic arrays.
My first implementation of Storage.pas, which achieved an average time of 4373us when compiled with -O4 and run on my x86 test machine, used the following data structure:
TreePtr = ^Tree;
TreeList = array of TreePtr;
Tree = object
sub: TreeList;
constructor init;
destructor deinit;
end;
Tree.deinit called dispose for all subs. When I remove the constructor and destructor and do deletions elsewhere, the average time goes down to 3800us; so the constructor/destructor calls are responsible for a ~13% speed-down.
When I only allocate without deleting anything, the average time goes down to 2300us; so the iteration over all Tree instances and calling dispose() on each causes a speed-down of ~65%!
In the C++ version I had an array of Tree, where Tree was embedded by value in the array, thus didn't require additional allocations/deallocations per sub element. Unfortunately I wasn't able to reproduce this with the Pascal syntax which didn't allow to use the type in the object by value. But fortunately user BeniBela demonstrated how to properly do this in FreePascal (see this thread:
https://forum.lazarus.freepascal.org/index.php?topic=64275.0). So now I'm using this data structure:
Tree = object
type TreeList = array of Tree;
public
sub: TreeList;
end;
The trick apparently is a local type declaration. The result is an average time of 2514us (instead of 4373us, corresponding to an overall speed-up of ~3%). So the solution with a dynamic allocation per array element causes a speed-down of 51% compared to the solution where Tree is embedded in the array by value.
I also checked the performance of the approach recommended by user runewalsh where the array of Tree is replaced by ^Tree and allocated by GetMem(count * sizeof(Tree)). This approach is tricky and not ideomatic at all (because it disregards language support for the required feature altogether and corresponds rather to C than to Pascal style of programming). But it is understandable why it is used; the resulting average time for the same benchmark goes down to 1010us (instead of 2514us, corresponding to an overall speed-up of ~7%); the ideomatic dynamic array solution is therfore 2.5 times or 150% slower than the non-ideomatic low-level solution based on GetMem/FreeMem.
All in all, these microbenchmarks have already revealed areas where FreePascal wastes performance unnecessarily compared to C++, and where the compiler could still be improved.
If you have any comments or questions, send me an email or post an issue to
https://github.com/rochus-keller/are-we-fast-yet/.