I\'m looking for a function that maps a multi-set of integers to an integer, hopefully with some kind of guarantee like pairwise independence.
Ideally, memory usage woul
Min-hashing should work here. Apply permutation, maintain a small multiset of n minimal elements, pick the biggest.
Elaborating: this is a simple way to work in O(1) time and space. You need something like a priority queue, without making the link to the initial values too obvious. So you order your priority queue according to some elaborate key, which is equivalent to running a priority queue on a permutation of the normal sort order. Make the queue keep track of multiplicity so that the selected elements also form a multiset.
That said, I'm not sure this disperses well enough (and running multiple permutations might become costly), so maybe build on Bradley's answer instead. Here is a tweak so that repeated elements don't cancel out:
xor(int_hash(x_n, multiplicity_n) foreach n)
The Knuth touches on this on TAoCP, and this is a near duplicate of What integer hash function are good that accepts an integer hash key?.
For your situation, turning your multi-set into a single integer and then performing the hash described in the linked post may be what you want to do. Turning a collection into a number is trivial; a concatenation of the digits will do.
For more info on Knuth's method, search for 'Knuth's Multiplicative Method'
-tjw
I have once asked a similar question, "Good hash function for permutations?", and got a hash that worked very well for my use case, I have very few collisions in my working code. It might work well for you too. Calculate something like this:
// initialize this->hash with 1
unsigned int hash = 1;
void add(int x) {
this->hash *= (1779033703 + 2*x);
}
So whenever you add a number x
, update your hash code with the above formula. The order of the values is not important, you will always get the same hash value.
When you want to merge two sets, just multiply the hash value.
The only thing I am not sure if it is possible is to remove a value in O(1).
I agree with Dzmitry on using of arithmetic SUM of hashes, but I'd recommend using a hash function with good output distribution for input integers instead of just reversing bits in the integer. Reversing bits doesn't improve output distribution. It can even worsen output distribution, since the probability that the high order bits will be lost due sum overflow is much higher that the probability that the low order bits will be lost in this case. Here is an example of a fast hash function with good output distribution: http://burtleburtle.net/bob/c/lookup3.c . Read also the paper describing how hash functions must be constructed - http://burtleburtle.net/bob/hash/evahash.html .
Using SUM of hash values for each element in the set satisfies requirements in the questions:
SUM and SUB are safe operations in the face of integer overflow, since they are reversible in a modular arithmetic, where modulus is 2^32 or 2^64 for integers in java.
For example 00001011 become 11010000. Then, just SUM all the reversed set elements.
If we need O(1) on insert/delete, the usual SUM will work (and that's how Sets are implemented in Java), though not well distributed over sets of small integers.
In case our set will not be uniformly distributed (as it usually is), we need mapping N->f(N), so that f(N) would be uniformly distributed for the expected data sample. Usually, data sample contains much more close-to-zero numbers than close-to-maximum numbers. In this case, reverse-bits hash would distribute them uniformly.
Example in Scala:
def hash(v: Int): Int = {
var h = v & 1
for (i <- 1 to 31) {
h <<= 1;
h |= ((v >>> i) & 1)
}
h
}
def hash(a: Set[Int]): Int = {
var h = 0
for (e: Int <- a) {
h += hash(e);
}
h
}
But the hash of our multi-set will not be uniform, though much better than simple SUM.
I asked this same question on cstheory.stackexchange.com and got a good answer:
https://cstheory.stackexchange.com/questions/3390/is-there-a-hash-function-for-a-collection-i-e-multi-set-of-integers-that-has