Hashing a set of integers in an order independent way

I want to hash a set of integers such that the order of the integers don't have an effect on the computed hash value. i.e. H([32224,12232,564423]) == H([564423,32224,12232]).

The number of unique sets will be in the range of a few millions. Speed is very important, but I need to know the upperbound on collisions with a chosen approach.

Wikipedia has a good section on hashing vectors, but I don't understand the math behind it to confidently implement them in code. I would appreciate if someone can explain the math involved with some code. Ideally, I would like the final hash to be of 32 bits. If it's of any use - I'll be implementing this in Java.

Update: I'm specifically looking to avoid sorting of the integers in the set, because of performance reasons (operating on lots of such sets).

A simple approach is to xor or add the hashes of the individual integers together. xor and add are commutative, so this satisfies order independence.

Thus:

int hc = 0;
for(int i = 0; i < n; i++) {
   hc += a[i];
}
return hc;

int hc = 0;
for(int i = 0; i < n; i++) {
   hc ^= a[i];
}
return hc;

because the hash code of an int is its value anyway.

In fact, this is exactly whatHashSet<Integer>.hashCode (uses add) will do. If your integers are already boxed, or you can handle boxing them, that's a built-in solution.

Assuming you need speed without the overhead of the *Set classes, then you could write H as follows:

/**
 * Hashes a set of integers.
 * 
 * @param list to hash
 * @return hash code
 */
public static int H(int list[]) {
    // XOR all the integers together.
    int hashcode = 0;
    for (int val : list) {
        hashcode ^= val;
    }
    return hashcode;
}

It is the same regardless of the order, and it is relatively efficient.

For example:

public static void main(String[] args) {
    System.out.println(Integer.toHexString(H(new int[]{0xabcd,0x1234,0x1111})));
    System.out.println(Integer.toHexString(H(new int[]{0x1234,0x1111,0xabcd})));
}

Displays:

a8e8
a8e8

This could be generalized to more than just ints by doing the following:

/**
 * Hashes a set of objects.
 * 
 * @param list to hash
 * @return hash code
 */
public static int H(Object list[]) {
    // XOR all the hashes together.
    int hashcode = 0;
    for (Object val : list) {
        hashcode ^= val.hashCode();
    }
    return hashcode;
}

The main program would then have to use arrays of Integer instead of the primitive int.

Adding the numbers should be almost as quick, and might give you a better distribution over the 32 bit range. If the elements of the set are already uniformly distributed over the range, then xor might be better.

However, with both methods, you can manufacture collisions easily with integers. For example, with the adding method;

{1000, 1001, 1002}
{0, 1, 3002}

Both of these arrays have the same H().

With the XOR method;

{0x1010, 0x0101}
{0x1111, 0x0000}

Both of these have the same H().

Similarly, the 0 element is problematic since lists will have the same hash with or without it. You can mitigate this by adding a constant value at each iteration. For example:

            ...
            hashcode += val.hashCode() + CONSTANT;
            ...

Or by including the number of elements as the initial hash code:

            ...
            // XOR all the hashes together.
            int hashcode = list.length;
            ...

You can put all the Integers in Java HashSet and use its hashCode.

On the other hand, java.util.Set does specify the following in documents:

Returns the hash code value for this set. The hash code of a set is defined to be the sum of the hash codes of the elements in the set, where the hash code of a null element is defined to be zero. This ensures that s1.equals(s2) implies that s1.hashCode()==s2.hashCode() for any two sets s1 and s2, as required by the general contract of Object.hashCode().

And Integer.hashCode() is then

a hash code value for this object, equal to the primitive int value represented by this Integer object.

Thus the hashCode for set of integers i1, i2, ... i_n in Java standard library is i1 + i2 + ... + i_n.

In case that the numbers are rather small, you can also multiply each element by a some suitably sized prime number. Knuth used 2654435761 which is too big for an java int, but you can take its 2-complement, -1640531527. Thus take C = -1640531527, and then your code is C*i1 + C*i2 + ... C*i_n.

private static final int C = -1640531527;

public static int calculateHash(int[] set) {
    int code = 0;
    for (int e: set) {
        code += C * e;
    }

    return code;
}

However there is 1 obvious flaw in the thinking. To make any use of the hashCode, you need to be able to prove that 2 sets are indeed equal, thus in any case the easiest way to prove is to sort the elements. Of course if there are substantially less than millions of sets then there are not that many collisions either.

I'd prefer summation rather then xoring, because 1) sum is used in Set's hashCode() implementation, 2) sum as approach to array hashing is recommended in Effective Java 3) it's less collision-prone. I suggest you to look at openjdk's AbstractSet implementation: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7-b147/java/util/AbstractSet.java?av=f

120    public int hashCode() {
121        int h = 0;
122        Iterator<E> i = iterator();
123        while (i.hasNext()) {
124            E obj = i.next();
125            if (obj != null)
126                h += obj.hashCode();
127        }
128        return h;
129    }

I'd also recommend making h long, and return (int) ((h & 0xffffffffL) & h >>> 32))

This is by no means trivial programming, but you could take inspiration from DES algorithm's S-boxes: by this you can achieve a good dispersion function which maps similar integers to very dissimilar ones. Then XOR-ing these dissimilar integers should no longer present a threat due to collisions.

来源：https://stackoverflow.com/questions/18021643/hashing-a-set-of-integers-in-an-order-independent-way

标签

java

hash

integer-hashing