What is a good hash function for a collection (i.e., multi-set) of integers?

后端 未结 6 1233
不知归路
不知归路 2021-02-05 06:03

I\'m looking for a function that maps a multi-set of integers to an integer, hopefully with some kind of guarantee like pairwise independence.

Ideally, memory usage woul

相关标签:
6条回答
  • 2021-02-05 06:29

    Min-hashing should work here. Apply permutation, maintain a small multiset of n minimal elements, pick the biggest.

    Elaborating: this is a simple way to work in O(1) time and space. You need something like a priority queue, without making the link to the initial values too obvious. So you order your priority queue according to some elaborate key, which is equivalent to running a priority queue on a permutation of the normal sort order. Make the queue keep track of multiplicity so that the selected elements also form a multiset.

    That said, I'm not sure this disperses well enough (and running multiple permutations might become costly), so maybe build on Bradley's answer instead. Here is a tweak so that repeated elements don't cancel out:

    xor(int_hash(x_n, multiplicity_n) foreach n)
    
    0 讨论(0)
  • 2021-02-05 06:33

    The Knuth touches on this on TAoCP, and this is a near duplicate of What integer hash function are good that accepts an integer hash key?.

    For your situation, turning your multi-set into a single integer and then performing the hash described in the linked post may be what you want to do. Turning a collection into a number is trivial; a concatenation of the digits will do.

    For more info on Knuth's method, search for 'Knuth's Multiplicative Method'

    -tjw

    0 讨论(0)
  • 2021-02-05 06:36

    I have once asked a similar question, "Good hash function for permutations?", and got a hash that worked very well for my use case, I have very few collisions in my working code. It might work well for you too. Calculate something like this:

    // initialize this->hash with 1
    unsigned int hash = 1;
    void add(int x) {
      this->hash *= (1779033703 + 2*x);
    }
    

    So whenever you add a number x, update your hash code with the above formula. The order of the values is not important, you will always get the same hash value.

    When you want to merge two sets, just multiply the hash value.

    The only thing I am not sure if it is possible is to remove a value in O(1).

    0 讨论(0)
  • 2021-02-05 06:46

    I agree with Dzmitry on using of arithmetic SUM of hashes, but I'd recommend using a hash function with good output distribution for input integers instead of just reversing bits in the integer. Reversing bits doesn't improve output distribution. It can even worsen output distribution, since the probability that the high order bits will be lost due sum overflow is much higher that the probability that the low order bits will be lost in this case. Here is an example of a fast hash function with good output distribution: http://burtleburtle.net/bob/c/lookup3.c . Read also the paper describing how hash functions must be constructed - http://burtleburtle.net/bob/hash/evahash.html .

    Using SUM of hash values for each element in the set satisfies requirements in the questions:

    • memory usage is constant. We need to store an ordinary integer containing hash value per each set. This integer will be used for O(1) updating of the hash when adding/removing elements from the set.
    • Addition of a new element requires only addition of the element's hash value to the existing hash value, i.e. the operation is O(1).
    • Removing of existing element requires only subtraction of the element's hash value from the existing hash value, i.e. the operation is O(1).
    • The hash will be different for sets, which differ only by pairs of identical elements.

    SUM and SUB are safe operations in the face of integer overflow, since they are reversible in a modular arithmetic, where modulus is 2^32 or 2^64 for integers in java.

    0 讨论(0)
  • 2021-02-05 06:50

    Reverse-bits.

    For example 00001011 become 11010000. Then, just SUM all the reversed set elements.


    If we need O(1) on insert/delete, the usual SUM will work (and that's how Sets are implemented in Java), though not well distributed over sets of small integers.

    In case our set will not be uniformly distributed (as it usually is), we need mapping N->f(N), so that f(N) would be uniformly distributed for the expected data sample. Usually, data sample contains much more close-to-zero numbers than close-to-maximum numbers. In this case, reverse-bits hash would distribute them uniformly.

    Example in Scala:

    def hash(v: Int): Int = {
            var h = v & 1
            for (i <- 1 to 31) {
                    h <<= 1;
                    h |= ((v >>> i) & 1)
            }
            h
    }
    def hash(a: Set[Int]): Int = {
            var h = 0
            for (e: Int <- a) {
                    h += hash(e);
            }
            h
    }
    

    But the hash of our multi-set will not be uniform, though much better than simple SUM.

    0 讨论(0)
  • 2021-02-05 06:52

    I asked this same question on cstheory.stackexchange.com and got a good answer:

    https://cstheory.stackexchange.com/questions/3390/is-there-a-hash-function-for-a-collection-i-e-multi-set-of-integers-that-has

    0 讨论(0)
提交回复
热议问题