hash_map uses a hash table. This is "constant" time in theory. Most implementations use a "collision" hash table. What happens in reality is:
- It creates a big table
- You have a "hash" function for your object that generates you a random place in the table (random-looking, but the hash function will always return the same value for your object) and usually this is the mod of the actual 32-bit (or 64-bit) hash value with the size of the table.
- The table looks to see if the space is available. If so it places the item in the table. If not it checks if the element there is the one you are trying to insert. If so it is a duplicate so no insert. If not, this is called a "collision" and it uses some formula to find another cell and this continues until it either finds a duplicate or an empty cell.
- When the table gets filled up too much it resizes. An efficient (in time) implementation will store all the original hash values together with the elements so it won't need to recalculate the hashes when it does this. In addition, comparing the hashes is usually faster than comparing the elements, so it can do this whilst searching to eliminate most of the collisions as a pre-step.
- If you never delete anything it is simple. However deleting elements adds an extra complexity. A cell that had an element in it which has been deleted is in a different state from one that was just empty all along, as you may have had collisions and if you just empty it, those elements won't be found. So there is usually some "mark". Of course now when we want to reuse the cell, we still have to recurse down in case there is a duplicate lower down (in which case we can't insert in this cell), then remember to reuse the deleted cell.
- The usual constraint is that your objects must be implemented to check for equality, but Dinkumware (or was it SGI) implemented theirs with operator< which might be slower but has the advantage of decoupling your elements and the type of associated container they can be stored in, although you still need a hash function to store in a hash.
The theory is that if you have a big enough table, the operations are constant time, i.e. it does not depend on the number of actual elements you have. In practice, of course, the more elements you have the more collisions occur.
std::map uses a binary tree. There is no need to define a hash function for an object, just strictly ordered comparison. On insertion it recurses down the tree to find the insertion point (and whether there are any duplicates) and adds the node, and may need to rebalance the tree so the depth of leaves is never more than 1 apart. Rebalancing time is relative to the depth of the tree too so all these operations are O(log N) where N is the number of elements.
The advantages of hash is the complexity
The advantages of the tree is:
- Totally scalable. It only uses what it needs, no need for a huge table or to pre-empt the size of the table, although hash may require less "baggage" per element than a tree.
- No need to hash first, which for a good function can take longer than the comparisons would if the data set is not large.
One other issue with std::map
is that it uses a single strictly-ordered comparison function whilst a "compare" function that returned -1, 0 or 1 would be a lot more efficient, particularly with the most commonly used key type, std::string, which already has this function implemented (it is char_traits::compare
). (This inefficiency is based on the premise that to check that x==y
, you check x<y
and y<x
so you do two comparisons. You would do this just once per lookup).