The primary goal of a hashmap is to store a data set and provide near constant time lookups on it using a unique key. There are two common styles of hashmap implementation:
- Separate chaining: one with an array of buckets (linked lists)
- Open addressing: a single array allocated with extra space so index collisions may be resolved by placing the entry in an adjacent slot.
Separate chaining is preferable if the hashmap may have a poor hash function, it is not desirable to pre-allocate storage for potentially unused slots, or entries may have variable size. This type of hashmap may continue to function relatively efficiently even when the load factor exceeds 1.0. Obviously, there is extra memory required in each entry to store linked list pointers.
Hashmaps using open addressing have potential performance advantages when the load factor is kept below a certain threshold (generally about 0.7) and a reasonably good hash function is used. This is because they avoid potential cache misses and many small memory allocations associated with a linked list, and perform all operations in a contiguous, pre-allocated array. Iteration through all elements is also cheaper. The catch is hashmaps using open addressing must be reallocated to a larger size and rehashed to maintain an ideal load factor, or they face a significant performance penalty. It is impossible for their load factor to exceed 1.0.
Some key performance metrics to evaluate when creating a hashmap would include:
- Maximum load factor
- Average collision count on insertion
- Distribution of collisions: uneven distribution (clustering) could indicate a poor hash function.
- Relative time for various operations: put, get, remove of existing and non-existing entries.
Here is a flexible hashmap implementation I made. I used open addressing and linear probing for collision resolution.
https://github.com/DavidLeeds/hashmap