Hash function on list independant of order of items in it

泪湿孤枕 提交于 2019-12-01 19:08:01

Basically all of the approaches here are instantiations of the same template. Map x1, …, xn to f(x1) op … op f(xn), where op is a commutative associative operation on some set X, and f is a map from items to X. This template has been used a couple of times in ways that are provably good.

  • Choose a random large prime p and a random residue b in [1, p - 1]. Let f(x) = bx mod p and let op be addition. We essentially interpret a set as a polynomial and use the Schwartz–Zippel lemma to bound the probability of a collision (= the probability that a nonzero polynomial has b as a root mod p).

  • Let op be XOR and let f be a randomly chosen table. This is Zobrist hashing and minimizes in expectation the number of collisions by straightforward linear-algebraic arguments.

Modular exponentiation is slow, so don't use it. As for Zobrist hashing, with 3 million items, the table f probably won't fit into L2, though it does set an upper bound of one main-memory access.

I would instead take Zobrist hashing as a departure point and look for a cheap function f that behaves like a random function. This is essentially the job description of a non-cryptographic pseudorandom generator – I would try computing f by seeding a fast PRG with x and generating one value.

EDIT: given that the sets all have the same sums, don't choose f to be a degree 1 polynomial (e.g., the step function of a linear congruential generator).

Use a HashSet<T> and HashSet<T>.CreateSetComparer(), which returns an IEqualityComparer<HashSet<T>>.

I think what is mentioned in this paper would definitely help:

http://people.csail.mit.edu/devadas/pubs/mhashes.pdf

Incremental Multiset Hash Functions and Their Application to Memory Integrity Checking

Abstract: We introduce a new cryptographic tool: multiset hash functions. Unlike standard hash functions which take strings as input, multiset hash functions operate on multisets (or sets). They map multisets of arbitrary finite size to strings (hashes) of fixed length. They are incremental in that, when new members are added to the multiset, the hash can be updated in time proportional to the change. The functions may be multiset-collision resistant in that it is difficult to find two multisets which produce the same hash, or just set-collision resistant in that it is difficult to find a set and a multiset which produce the same hash.

I think your squaring idea is going in the right direction, but a poor choice of function. I'd try something more like the PRNG functions or just multiplication by a large prime, followed by XOR of all the resulting values.

One possibility: sort the items in the list, then hash that.

You could sort the numbers and select a sample from predetermined indices and leave rest as zero if current value has less numbers. Or you could xor them, or whatever.

Why not something like

public int GetOrderIndependantHashCode(IEnumerable<int> source)
{
    return (source.Select(x => x*x).Sum()
            + source.Select(x => x*x*x).Sum()
            + source.Select(x => x*x*x*x).Sum()) & 0x7FFFFF;
}

If the range of the values in key happens to be limited to low-ish positive integers, you could map each one to a prime number using a simple lookup, then multiply them together to arrive at the value.

Using the example in the question:

[1, 2, 3] maps to 2 x 3 x 5 = 30
[3, 2, 1] maps to 5 x 3 x 2 = 30

Create your own type that implements IEnumerable<T>.

Override GetHashCode. In it, sort your collection, call and return ToArray().GetHashCode().

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!