I want to have a dictionary that assigns a value to a set of integers.
For example key
is [1 2 3]
and value
will have certain
You could sort the numbers and select a sample from predetermined indices and leave rest as zero if current value has less numbers. Or you could xor them, or whatever.
Basically all of the approaches here are instantiations of the same template. Map x1, …, xn to f(x1) op … op f(xn), where op is a commutative associative operation on some set X, and f is a map from items to X. This template has been used a couple of times in ways that are provably good.
Choose a random large prime p and a random residue b in [1, p - 1]. Let f(x) = bx mod p and let op be addition. We essentially interpret a set as a polynomial and use the Schwartz–Zippel lemma to bound the probability of a collision (= the probability that a nonzero polynomial has b as a root mod p).
Let op be XOR and let f be a randomly chosen table. This is Zobrist hashing and minimizes in expectation the number of collisions by straightforward linear-algebraic arguments.
Modular exponentiation is slow, so don't use it. As for Zobrist hashing, with 3 million items, the table f probably won't fit into L2, though it does set an upper bound of one main-memory access.
I would instead take Zobrist hashing as a departure point and look for a cheap function f that behaves like a random function. This is essentially the job description of a non-cryptographic pseudorandom generator – I would try computing f by seeding a fast PRG with x and generating one value.
EDIT: given that the sets all have the same sums, don't choose f to be a degree 1 polynomial (e.g., the step function of a linear congruential generator).
One possibility: sort the items in the list, then hash that.
I think your squaring idea is going in the right direction, but a poor choice of function. I'd try something more like the PRNG functions or just multiplication by a large prime, followed by XOR of all the resulting values.
Why not something like
public int GetOrderIndependantHashCode(IEnumerable<int> source)
{
return (source.Select(x => x*x).Sum()
+ source.Select(x => x*x*x).Sum()
+ source.Select(x => x*x*x*x).Sum()) & 0x7FFFFF;
}
I think what is mentioned in this paper would definitely help:
http://people.csail.mit.edu/devadas/pubs/mhashes.pdf
Incremental Multiset Hash Functions and Their Application to Memory Integrity Checking
Abstract: We introduce a new cryptographic tool: multiset hash functions. Unlike standard hash functions which take strings as input, multiset hash functions operate on multisets (or sets). They map multisets of arbitrary finite size to strings (hashes) of fixed length. They are incremental in that, when new members are added to the multiset, the hash can be updated in time proportional to the change. The functions may be multiset-collision resistant in that it is difficult to find two multisets which produce the same hash, or just set-collision resistant in that it is difficult to find a set and a multiset which produce the same hash.