Sub-optimal Hashing Functions -- Where have I gone wrong?

EganSolo

Sr. Member
Posts: 290

Sub-optimal Hashing Functions -- Where have I gone wrong?

« on: January 25, 2020, 03:07:05 am »

I've written the attached program (Simple command line program) to test a few hash functions. I've relied on the HashLib package and on a few hash methods I wrote.
The program is invoked as follows:

test_hashfunction cs=false hs=120 ct=100

Where cs stands for case sensitive, hs for the hash size and ct for the count of keys to create. In the case above, case sensitive is false, the size of the hash is 120 slots for 100 keys.

The program then outputs a simple summary of the run as follows:

Test FNV1a / Sequential Ids
KeyCount : 100
HashSize : 120
Total number of collisions: 20
Number of collion lists : 10
Max depth of list : 2
Rate of collision : 20%

The test used FNV1a as a hashing method and a sequential id key generator (included in the code). It has a total of twenty collisions distributed across 10 collision list with a max depth of 2. The total rate of collision is therefore 20%

I've used two key generators: one that creates keys of the form ID_000000001, ID_00000002, etc. They have the same length and they keep increasing sequentially. The second key generator creates GUIDs.

What's troubling me is that none of the hashing methods I've used could improve on this 20% collision rate. In fact when using a GUID-based id generator, the collision rate climbs us to 50-60%. Below is the output I get when I run this program.

Question: Are these rates right? I thought hashing methods did a better distribution job. Your input is very much appreciated.

Test FNV1a / Sequential Ids
KeyCount : 100
HashSize : 120
Total number of collisions: 20
Number of collion lists : 10
Max depth of list : 2
Rate of collision : 20%

Test FNV1b / Sequential Ids
KeyCount : 100
HashSize : 120
Total number of collisions: 52
Number of collion lists : 17
Max depth of list : 4
Rate of collision : 52%

Test Murmur3 / Sequential Ids
KeyCount : 100
HashSize : 120
Total number of collisions: 54
Number of collion lists : 23
Max depth of list : 4
Rate of collision : 54%

Test Murmur3b / Sequential Ids
KeyCount : 100
HashSize : 120
Total number of collisions: 60
Number of collion lists : 26
Max depth of list : 4
Rate of collision : 60%

Test HashLibMM3 / Sequential Ids
KeyCount : 100
HashSize : 120
Total number of collisions: 59
Number of collion lists : 24
Max depth of list : 4
Rate of collision : 59%

Test FNV1a / Guids
KeyCount : 100
HashSize : 120
Total number of collisions: 63
Number of collion lists : 27
Max depth of list : 4
Rate of collision : 63%

Test FNV1b / Guids
KeyCount : 100
HashSize : 120
Total number of collisions: 55
Number of collion lists : 23
Max depth of list : 4
Rate of collision : 55%

Test Murmur3 / Guids
KeyCount : 100
HashSize : 120
Total number of collisions: 61
Number of collion lists : 26
Max depth of list : 4
Rate of collision : 61%

Test Murmur3b / Guids
KeyCount : 100
HashSize : 120
Total number of collisions: 67
Number of collion lists : 29
Max depth of list : 4
Rate of collision : 67%

Test HashLibMM3 / Guids
KeyCount : 100
HashSize : 120
Total number of collisions: 52
Number of collion lists : 24
Max depth of list : 3
Rate of collision : 52%

Test_HashFunction.zip (2.86 kB - downloaded 65 times.)

Logged

440bx

Hero Member
Posts: 4029

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

« Reply #1 on: January 25, 2020, 08:11:19 am »

I couldn't compile the project you posted (it's missing a unit.)

However, I can make some very general observations that are very likely to be negatively influencing the results you're getting.

First, the table capacity should be a prime number. Using a non-prime and worse, an even number (120), is not going to produce optimal results. The reason is, when you mod fHashSize, non prime capacities generate cycles and even numbers are the worst offenders ~~because you're losing the LSB which, when hashing, is very significant~~ .

The other thing I see is that most of those algorithm depend on overflowing the calculated hash index. That's usually not a good thing, because it will increase the tendency of cycling through values, particularly when combined with a non-prime table capacity (but even a prime number won't help much in that situation.) IOW, the result is, you're getting more clustering than you normally would.

My suggestions would be: 1. ensure you use a prime number for the table capacity and 2. chose algorithms that won't cause the hash value to often overflow (it seems multiplicative hash algorithms are many programmer's favorite but, they often don't perform as well as desired/expected.)

« Last Edit: January 25, 2020, 08:31:48 am by 440bx »

Logged

(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

guest58172

Guest

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

« Reply #2 on: January 25, 2020, 10:37:46 am »

I'd rather keep a bucket count equal or over to the next pow of 2 following the number of keys.

127 keys -> 128 buckets
128 keys -> 128 buckets
129 keys -> 256 buckets
...
1025 keys -> 2048 buckets
...
etc...

Why ?

Because ideally 1 bucket should be occupied by 1 key only. Then collision handling strategies are **just** there because hashing funcs are (mostly) ) imperfects. If you follow this rule you'll get a better idea of collision handling.

Logged

Thaddy

Hero Member
Posts: 14373
Sensorship about opinions does not belong here.

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

« Reply #3 on: January 25, 2020, 10:44:45 am »

The project is not complete, but so many collisions on such a small set does not even occur with a simple elf hash.
Can you provide the full source?

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

howardpc

Hero Member
Posts: 4144

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

« Reply #4 on: January 25, 2020, 11:09:30 am »

EganSolo did mention that he used the HashLib package.
It is available via the OPM, and provides the missing HlpHashFactory unit.

Logged

julkas

Guest

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

« Reply #5 on: January 25, 2020, 05:48:29 pm »

Try DJB hash.

Logged

EganSolo

Sr. Member
Posts: 290

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

« Reply #6 on: January 25, 2020, 10:15:47 pm »

Thanks for everyone's reply!

The unit is from a free pascal library called HashLib (I should have mentioned that!). It's wiki page is here: https://wiki.freepascal.org/HashLib4Pascal. The link to the actual package is found at the bottom of this page and it's https://github.com/Xor-el/HashLib4Pascal.
I have tried a variety of hash sizes including prime number sizes. Some were less than 100 (like 97) and some were greater than 100 (like 113). Here's the weird thing: the best result I achieved was with hash size = 120, which is why I'm wondering if I'm doing something wrong!
440bx has a good point: yes, all of the hash functions I'm using assume overflowing will happen. What's intriguing is the fact that some of these functions such as FNV1a and Murmur3 are touted as some of the best hash methods out there. I didn't come up with those, as the excellent HashLib package shows.
I'll give DJB a go and will report back here
If someone else could download, install and run HashLib and then run the program, I'd be really curious to know if you're results are different

Thanks again for everyone's input.

Logged

julkas

Guest

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

« Reply #7 on: January 26, 2020, 12:33:24 pm »

@EganSolo What is your goal?
I don’t know much about theory, but often use hashing in practice.
For example, look at following CP problem (N, K, 1 ≤ K ≤ N ≤ 10**5) https://www.spoj.com/problems/ADACLEAN/,
here even Python works fine (https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm).

Logged

EganSolo

Sr. Member
Posts: 290

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

« Reply #8 on: January 28, 2020, 05:16:44 am »

Hi Julkas:
Great question! You asked what is it that I am trying to solve?
I'm working with arrays that need to hold up to 10 million records. Records are added, removed and queried randomly. Finding a record fast is critical for good performance.
Clearly, I could use a balanced tree but I wanted to know how a hash would perform, one in which collision are stored inside collision lists. In such a configuration, the growth of the hash is key to maintaining good performance: Once the number of elements in the hash (count) exceeds its current size (not counting collision lists), do you double it? do you grow it by 20%? What's the right approach?

I created this test program to find out what hash size make sense, and I was surprised to see the high level of collision (upward of 20%) with collision lists that are as deep as 7 elements. The reason why 7 matters is that if you compare the hash to a binary search in a balanced binary tree, then since log of 10 million is 16 (or 17), it means that the hash will outperform the binary tree (in the worst case). On average, the performance might be similar or even better with a balanced tree, but since I'm running in mem, the actual comparison might be moot.

What is not moot is the additional space required by both solutions: a balanced binary tree requires additional storage to keep track of the tree structure. I haven't worked out the actual math yet. The point of this post is that I was taken aback by the high rate of collisions: I was expecting a rate in a single digit (particularly when the size of the hash is larger than the number of nodes).

The attached program requires the HashLib package which is found here https://github.com/Xor-el/HashLib4Pascal. Simply unzip the program into a folder, install the HashLib package and run it. It is mostly self-explanatory. If you have questions, shoot me an email.

You'll see that the collision rate is high even with murmur3 and fnv1a. I thought that I must have made a mistake somewhere, but it looks like that's the nature of hashing: high collision lists with (hopefully) small collision lists to make up for it and a waste of memory that is, hopefully, no worse than what you get when implementing a balanced binary tree.

TestHF.zip (11.78 kB - downloaded 58 times.)

Logged

440bx

Hero Member
Posts: 4029

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

« Reply #9 on: January 28, 2020, 06:43:08 am »

Quote from: EganSolo on January 28, 2020, 05:16:44 am

I'm working with arrays that need to hold up to 10 million records. Records are added, removed and queried randomly. Finding a record fast is critical for good performance.
Clearly, I could use a balanced tree but I wanted to know how a hash would perform, one in which collision are stored inside collision lists. In such a configuration, the growth of the hash is key to maintaining good performance: Once the number of elements in the hash (count) exceeds its current size (not counting collision lists), do you double it? do you grow it by 20%? What's the right approach?

Changing the hash table size requires rehashing the entries in the new table. Depending on the number of keys and the complexity of the hash algorithm, that can be very time consuming. In your case, you're anticipating 10 million entries, at 8 bytes per pointer (presuming 64bit), that's a hash index of 80MB. It sounds like a lot but, todays machines have gigabytes of memory, 80MB is not even a drop in the bucket, it's mist. I'd say, allocate as large a block as you anticipate necessary to hold the 10 million entries. If the hash function is "good", a table with an 80% load will have the majority of chains be no longer than 2 elements (note that this is an _average_, not a guarantee.) This means your hash table should have a capacity of about 12,500,003 (a prime number) entries.

Note that I put "good" in quotation marks. The word "good" is normally used to indicate that a particular hashing algorithm generates a set of keys that is really random. A number of assumptions are made before a hashing algorithm is declared "good". To make a very long story short, the most reasonable (and probably best) is to try a few algorithms (which is what you did, you got the right idea) and pick the one that produces close to the theoretical results it should produce.

The other thing is, hashing algorithms are a dime a billion and, you can create another one that works well with the data you're dealing with (that's the one you want.)

From the results you posted, it seems the algorithms in HashLib don't like your data much. Personally, I'd look for something else. You'll find a number of hashing functions to try at : https://www.partow.net/programming/hashfunctions/ and a little refreshing common sense about hashing algorithms in general. Also, at the bottom of the page, you'll find every algorithm he mentions implemented in a number of languages, Pascal included, IOW, help yourself to whichever gives you the best results.

HTH.

Logged

(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

Xor-el

Sr. Member
Posts: 404

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

« Reply #10 on: January 28, 2020, 08:28:03 am »

@EganSolo. Have you tried out XXHash32? I have heard good things about this algorithm.
It is also available in HashLib4Pascal.

Logged

bytebites

Hero Member
Posts: 640

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

« Reply #11 on: January 28, 2020, 09:41:31 am »

Code: Pascal [Select][+]

function Simple(const S : String) : Cardinal;
begin
  Result := StrToInt(copy(s,4));
  Result := Result mod fHashSize;
end;   

This hash function gives collisions rate 0% in the serialID-test.

Logged

julkas

Guest

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

« Reply #12 on: January 28, 2020, 06:10:09 pm »

Quote from: EganSolo on January 28, 2020, 05:16:44 am

I'm working with arrays that need to hold up to 10 million records. Records are added, removed and queried randomly. Finding a record fast is critical for good performance.

May be you can use database (in memory?) for your project?

Logged

garlar27

Hero Member
Posts: 652

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

« Reply #13 on: January 28, 2020, 08:03:30 pm »

Quote from: julkas on January 28, 2020, 06:10:09 pm

Quote from: EganSolo on January 28, 2020, 05:16:44 am
I'm working with arrays that need to hold up to 10 million records. Records are added, removed and queried randomly. Finding a record fast is critical for good performance.
May be you can use database (in memory?) for your project?

Yes, I recall someone who use it to process DNA comparisons. In-memory DB did the job for him/her, I also recall that SQLite didn't do a good job since the SQLite temporary memory table end up on the hard-drive

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: Sub-optimal Hashing Functions -- Where have I gone wrong? (Read 3357 times)

EganSolo

Sub-optimal Hashing Functions -- Where have I gone wrong?

440bx

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

guest58172

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

Thaddy

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

howardpc

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

julkas

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

EganSolo

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

julkas

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

EganSolo

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

440bx

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

Xor-el

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

bytebites

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

julkas

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

garlar27

Re: Sub-optimal Hashing Functions -- Where have I gone wrong?

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook