Hi Julkas:
Great question! You asked what is it that I am trying to solve?
I'm working with arrays that need to hold up to 10 million records. Records are added, removed and queried randomly. Finding a record fast is critical for good performance.
Clearly, I could use a balanced tree but I wanted to know how a hash would perform, one in which collision are stored inside collision lists. In such a configuration, the growth of the hash is key to maintaining good performance: Once the number of elements in the hash (count) exceeds its current size (not counting collision lists), do you double it? do you grow it by 20%? What's the right approach?
I created this test program to find out what hash size make sense, and I was surprised to see the high level of collision (upward of 20%) with collision lists that are as deep as 7 elements. The reason why 7 matters is that if you compare the hash to a binary search in a balanced binary tree, then since log of 10 million is 16 (or 17), it means that the hash will outperform the binary tree (in the worst case). On average, the performance might be similar or even better with a balanced tree, but since I'm running in mem, the actual comparison might be moot.
What is not moot is the additional space required by both solutions: a balanced binary tree requires additional storage to keep track of the tree structure. I haven't worked out the actual math yet. The point of this post is that I was taken aback by the high rate of collisions: I was expecting a rate in a single digit (particularly when the size of the hash is larger than the number of nodes).
The attached program
requires the HashLib package which is found here
https://github.com/Xor-el/HashLib4Pascal. Simply unzip the program into a folder, install the HashLib package and run it. It is mostly self-explanatory. If you have questions, shoot me an email.
You'll see that the collision rate is high even with murmur3 and fnv1a. I thought that I must have made a mistake somewhere, but it looks like that's the nature of hashing: high collision lists with (hopefully) small collision lists to make up for it and a waste of memory that is, hopefully, no worse than what you get when implementing a balanced binary tree.