Best practices on what should be key in a hashtable

问题

The best look-up structure is a HashTable. It provides constant access on average (linear in worst case).
This depends on the hash function. Ok.
My question is the following. Assuming a good implementation of a HashTable e.g. HashMap is there a best practice concerning the keys passed in the map?I mean it is recommended that the key must be an immutable object but I was wondering if there are other recommendations.
Example the size of the key? For example in a good hashmap (in the way described above) if we used String as keys, won't the "bottleneck" be in the string comparison for equals (trying to find the key)? So should the keys be kept small? Or are there objects that should not be used as keys? E.g. a URL? In such cases how can you choose what to use as a key?

回答1:

You should use whatever key you want to use to lookup things in the data structure, it's typically a domain-specific constraint. With that said, keep in mind that both hashCode() and equals() will be used in finding a key in the table.

hashCode() is used to find the position of the key, while equals() is used to determine if the key you are searching for is actually the key that we just found using hashCode().

For example, consider two keys a and b that have the same hash code in a table using separate chaining. Then a search for a would require testing if a.equals(key) for potentially both a and b in the table once we find the index of the list containing a and b from hashCode().

回答2:

The best performing key for an HashMap is probably an Integer, where hashCode() and equals() are implemented as:

public int hashCode() {
    return value;
}

public boolean equals(Object obj) {
    if (obj instanceof Integer) {
        return value == ((Integer)obj).intValue();
    }
    return false;
}

Said that, the purpose of an HashMap is to map some object (value) to some others (key). The fact that a hash function is used to address the (value) objects is to provide fast, constant-time access.

it is recommended that the key must be an immutable object but I was wondering if there are other recommendations.

The recommendation is to Map objects to what you need: don't think what is faster; but think what is the best for your business logic to address the objects to retrieve.

The important requirement is that the key object must be immutable, because if you change the key object after storing it in the Map it may be not possible to retrieve the associated value later.

The key word in HashMap is Map. Your object should just map. If you sacrifice the mapping task optimizing the key, you are defeating the purpose of the Map - without probably achieving any performance boost.

I 100% agree with the first two comments in your question:

the major constraint is that it has to be the thing that you want to base the lookup on ;)
– Oli Charlesworth

The general rule is to use as the key whatever you need to look up with.
– Louis Wasserman

Remember the two rules for optimization:

Don't.
(for experts only) don't yet.

The third rule is: profile before to optimize.

回答3:

it is recommended that the key must be an immutable object but I was wondering if there are other recommendations.

The key of the value should be final.

Most times a field of the object is used as key. If that field changes then the map cannot find it:

void foo(Employee e) {
  map.put(e.getId(), e);
  String newId = e.getId() + "new";
  e.setId(newId);
  Employee e2 = e.get(newId);
  // e != e2 !
}

So Employee should not have a setId() method at all, but that is difficult because when you are writing Employee you don't know what it will be keyed by.

回答4:

I digged up the implementation. I had an assumption that the effectiveness of the hashCode() method will be the key factor.

When I looked into the HashMap() and the Hashtable() implementation, I found that the implementation is quite similar (with one exception). Both are using and storing an internal hash code for all entries, so that's a good point that hashCode() is not so heavily influencing the performance.

Both are having a number of buckets, where the values are stored. It is important balance between the number of buckets (say n), and the average number of keys within a bucket (say k). The bucket is found in O(1) time, the content of the bucket is iterated in O(k) size, but the more bucket we have, the more memory will be allocated. Also, if many buckets are empty, it means that the hashCode() method for the key class does not the hashcode wide enough.

The algorithm works like this:

Take the `hashCode()` of the Key (and make a slight bijective transformation on it)
Find the appropriate bucket
Loop through the content of the bucket (which is some kind of LinkedList)
Make the comparison of the keys as follows:
1. Compare the hashcodes 
    (it is calculated in the first step, and stored for the entry)
2. Examine if key `==` the stored key (still no call) 
    (this step is missing from Hashtable)
3. Compare the keys by `key.equals(storedKey)`

To summarize:

hashCode() is called once per call (this is a must, you cannot do without it)
equals() is called if the hashCode is not so well spread, and two keys happen to have the same hashcode

The same algorithm is for get() and put() (because in put() case you can set the value for an existing key). So, the most important thing is how the hashCode() method was implemented. That is the most frequently called method.

Two strategies are: make it fast and make it effective (well-spread). The JDK developers made efforts to make it both, but it's not always possible to have them both.

Numeric types are good
Object (and non-overriden classes) are good (hashCode() is native), except that you cannot specify an own equals()
String is not good, iterates through the characters, but caches after that (see my comment below)
Any class with synchronized hashCode() is not good
Any class that has an iteration is not good
Classes that have hashcode cache are a bit better (depends on the usage)

Comment on the String: To make it fast, in the first versions of JDK the String hash code calculation was made for the first 32 characters only. But the hashcode it produced was not well spread, so they decided to take all the characters into the hashcode.

来源：https://stackoverflow.com/questions/14367081/best-practices-on-what-should-be-key-in-a-hashtable

标签

java

algorithm

data-structures

hashmap

hashtable