clarifying facts behind Java's implementation of HashSet/HashMap

问题

1. I understand the different hash map mechanisms and the ways in which key collisions are handled (either open addressing -linear/quadratic probing, chaining, extendable hashing, etc. Which one does HashSet/HashMap make use of?

2. I realise that a good HashMap relies on a good hash function. How does Java's HashSet/HashMap hash the objects? I know that there is a hash function but so far for strings I have not needed to implement this. What if I now want to Hash a Java Object that I create - do I need to implement the hash function? Or does Java have a built in way of creating a hash code?

I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.

回答1:

You could answer many of these questions yourself, by reading the source code for HashMap.

(Hint: you can usually find the source code for Java SE classes using Google; e.g. search for "java.util.HashMap source".)

I understand the different hash map mechanisms and the ways in which key collisions are handled (either open addressing -linear/quadratic probing, chaining, extendable hashing, etc. Which one does HashSet/HashMap make use of?

Chaining. See the source code. (Line 154 in the version I linked to).

How does Java's HashSet/HashMap hash the objects?

It doesn't. The object's hashCode method is called to do this. See the source code. (line 360).

If you look at the code you will see some interesting wrinkles:

The code (in the version I linked to) is hashing Strings using a special method. (It appears that this is to allow hashing of strings to be "tuned" at the platform level. I didn't dig into this ...)
The hashcode returned by the Object.hashCode() call is "scrambled" further to reduce the chance of collisions. (Read the comment!)

What if I now want to Hash a Java Object that I create - do I need to implement the hash function?

You can do that.

Whether you need to do this depends on how you have defined equals for the class. Specifically, Java's HashMap, HashSet and related classes place the following requirement on hashcode() and equals(Object):

If a.equals(b) then a.hashCode() == b.hashCode().
While a is in a HashSet or is a key in a HashMap, the value returned by a.hashCode() must not change.
if !a.equals(b), then the probability that a.hashCode() == b.hashCode() should be low, especially if a and b are probably hash keys for the application.

(The last requirement for performance reasons. If you you have a "poor" hash function that results in a high probability that different keys hash the same hashcode, you get lots of collisions. The hash chains will become unbalanced, and you won't get the average O(1) performance that is normally expected of hash table operations. In the worst case, performance will be O(N); i.e. equivalent to a linear search of a linked list.)

Or does Java have a built in way of creating a hash code?

Every class inherits a default hashCode() method from Object (unless this is overridden). It uses what is known as an "identity hash code"; i.e. a hash value that is based on the object's identity (its reference). This matches the default implementation of equals(Object) ... which simply uses == to compare references.

I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.

This is incorrect.

The default hashCode() method returns the "identity hashcode". This is typically based on the object's memory address at some point time, but it is NOT the object's memory address.

In particular, if an object is moved by the garbage collector, its "identity hashcode" is guaranteed not to change. Yes. That's right, it DOES NOT CHANGE ... even though the object was moved!

^{(How they implement this efficiently is rather clever. See https://stackoverflow.com/a/3796963/139985 for details.)}

The bottom line is that the default Object.hashCode() method satisfies all of the requirements that I listed above. It can be relied on.

回答2:

Question 1)

The Java HashMap implementation uses the chaining implementation to deal with collisions. Think of it as an array of linked lists.

Question 2

Object has a default implementation of equals and hashCode. equals is implemented as return this == other and hashcode is (to all intents and purposes) implemented as assigning a random identifier to each instance and using that as the hashCode.

As all classes in Java extends Object, they all inherit these implementations.

Some classes override these implementations by default. String, as you mentioned, is a very good example. Another is the classes in the collections API - so ArrayList implements these methods based on the elements it contains.

As far as implementing a good hashCode, this is a little bit of a dark art. Here's a pretty good summary of best practice.

Your final comment:

I know that the default implementation cannot be relied on as it bases the hash function on the memory address which is not constant.

This is not correct. The default implementation of hashCode is constant as that is part of the method's contract. From the Javadoc:

Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.

来源：https://stackoverflow.com/questions/20849809/clarifying-facts-behind-javas-implementation-of-hashset-hashmap

标签

java

hash

hashmap

hashset