Is there a way to detect collision in Java Hash-map ? Can any one point out some situation\'s where lot of collision\'s can take place. Of-course if you override the hashcode fo
Simple example: hashing a Long
. Obviously there are 64 bits of input and only 32 bits of output. The hash of Long
is documented to be:
(int)(this.longValue()^(this.longValue()>>>32))
i.e. imagine it's two int
values stuck next to each other, and XOR them.
So all of these will have a hashcode of 0:
0
1L | (1L << 32)
2L | (2L << 32)
3L | (3L << 32)
etc
I don't know whether that counts as a "huge number of collisions" but it's one example where collisions are easy to manufacture.
Obviously any hash where there are more than 232 possible values will have collisions, but in many cases they're harder to produce. For example, while I've certainly seen hash collisions on String
using just ASCII values, they're slightly harder to produce than the above.
The other two answers I see a good IMO but I just wanted to share that the best way to test how well your hashCode()
behaves in a HashMap
is to actually generate a big number of objects from your class, put them in the particular HashMap
implementation as the key and test CPU and memory load. 1 or 2 million entries are a good number to measure but you get best results if you test with your anticipated Map sizes.
I just looked at a class that I doubted its hashing function. So I decided to fill in a HashMap with random objects of that type and test number of collisions. I tested two hashCode() implementations of the class under investigation. So I wrote in groovy the class you see at the bottom extending openjdk implementation of HashMap to count number of collisions into the HashMap (see countCollidingEntries()
). Note that these are not real collisions of the whole hash but collisions in the array holding the entries. Array index is calculated as hash & (length-1)
which means that as short the size of this array is, the more collisions you get. And size of this array depends on initialCapacity
and loadFactor
of the HashMap
(it can increase when put()
more data).
At the end though I considered that looking at these numbers does little sense. The fact that HashMap is slower with bad hashCode()
method means that by just benchmarking insertion and retrieval of data from the Map you effectively know which hashCode()
implementation is better.
public class TestHashMap extends HashMap {
public TestHashMap(int size) {
super(size);
}
public TestHashMap() {
super();
}
public int countCollidingEntries() {
def fs = this.getClass().getSuperclass().getDeclaredFields();
def table;
def count =0 ;
for ( java.lang.reflect.Field field: fs ) {
if (field.getName() == "table") {
field.setAccessible(true);
table = field.get(super);
break;
}
}
for(Object e: table) {
if (e != null) {
while (e.next != null) {
count++
e = e.next;
}
}
}
return count;
}
}
I have created a project to benchmark these sort of things: http://code.google.com/p/hashingbench/ (For hashtables with chaining, open-addressing and bloom filters).
Apart from the hashCode() of the key, you need to know the "smearing" (or "scrambling", as I call it in that project) function of the hashtable. From this list, HashMap's smearing function is the equivalent of:
public int scramble(int h) {
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
So for a collision to occur in a HashMap, the necessary and sufficient condition is the following : scramble(k1.hashCode()) == scramble(k2.hashCode())
. This is always true if k1.hashCode() == k2.hashCode()
(otherwise, the smearing/scrambling function wouldn't be a function), so that's a sufficient, but not necessary condition for a collision to occur.
Edit: Actually, the above necessary and sufficient condition should have been compress(scramble(k1.hashCode())) == compress(scramble(k2.hashCode()))
- the compress
function takes an integer and maps it to {0, ..., N-1}
, where N
is the number of buckets, so it basically selects a bucket. Usually, this is simply implemented as hash % N
, or when the hashtable size is a power of two (and that's actually a motivation for having power-of-two hashtable sizes), as hash & N
(faster). ("compress" is the name Goodrich and Tamassia used to describe this step, in the Data Structures and Algorithms in Java). Thanks go to ILMTitan for spotting my sloppiness.
Other hashtable implementations (ConcurrentHashMap, IdentityHashMap, etc) have other needs and use another smearing/scrambling function, so you need to know which one you're talking about.
(For example, HashMap's smearing function was put into place because people were using HashMap with objects with the worst type of hashCode() for the old, power-of-two-table implementation of HashMap without smearing - objects that differ a little, or not at all, in their low-order bits which were used to select a bucket - e.g. new Integer(1 * 1024)
, new Integer(2 * 1024)
*, etc. As you can see, the HashMap's smearing function tries its best to have all bits affect the low-order bits).
All of them, though, are meant to work well in common cases - a particular case is objects that inherit the system's hashCode().
PS: Actually, the absolutely ugly case which prompted the implementors to insert the smearing function is the hashCode() of Floats/Doubles, and the usage as keys of values: 1.0, 2.0, 3.0, 4.0 ..., all of them having the same (zero) low-order bits. This is the related old bug report: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4669519