Obtaining a k-wise independent hash function

百般思念 提交于 2019-12-05 01:23:16

问题


I need to use a hash function which belongs to a family of k-wise independent hash functions. Any pointers on any library or toolkit in C, C++ or python which can generate a set of k-wise independent hash functions from which I can pick a function.

Background: I am trying to implement this algorithm here: http://researcher.watson.ibm.com/researcher/files/us-dpwoodru/knw10b.pdf for the Distinct Elements problem.

I have looked at this thread: Generating k pairwise independent hash functions which mentions using Murmur hash to generate a pairwise independent hash function. I was wondering if there is anything similar for k-wise independent hash functions. If there is none available, would it be possible for me to construct such a set of k-wise independent hash functions.

Thanks in advance.


回答1:


This is one of many solutions, but you could use for example the following open-source hash algorithm : http://code.google.com/p/xxhash/

Then, to generate different hashes, you just have to provide different seeds.

Considering the main function declaration : unsigned int XXH32 (const void* input, int len, unsigned int seed);

So if you need k different hash, just re-use the same algorithm k times, with k different seeds.




回答2:


The simplest k-wise independent hash function (mapping positive integer x < p to one of m buckets) is just

where p is some big random prime (261-1 will work) and ai are some random positive integers less than p, a0 > 0.

2-wise independent hash: h(x) = (ax + b) % p % m

again, p is prime, a > 0, a,b < p (i.e. a can't be zero but b can when that is a random choice)

These formulas define families of hash functions. They work (in theory) if you select a hash function randomly from corresponding family (i.e. if you generate random a's and b) each time you run your algorithm.




回答3:


There is no such thing as "a k-wise independent hash function". However, there are k-wise independent families of functions.

As a reminder, a family of functions is k-wise independent when if h is picked randomly from the family and x_1 .. x_k and y_1 .. y_k are picked arbitrarily, the probability that "for all i, h(x_i) = y_i" is Y^-k, where Y is the size of the co-domain from which the y_i were selected.

There are a few families of functions that are known to be k-wise independent for small k like 2, 3, 4, and 5. For arbitrary k, you will likely need to use polynomial hashing. Note that there are two variants of this, one of which is not even 2-independent, so be careful when implementing it.

The polynomial hash family can hash from a field F to itself using k constants a_0 through a_{k-1} and is defined by the sum of a_i x^i, where x is the key you are hashing. Field arithmetic can be implemented on your computer by taking letting F be the integers modulo a prime p. That's probably not convenient, as it is often better to have the domain and range be uint32_t or the like. In that case you can use the field F_{2^32}, and you can use polynomial multiplication over Z_2 and then division by an irreducible polynomial in that field. Otherwise, you can operate in Z_p where p is larger than 2^32 (or 64) and take the result of the polynomial mod 2^32, I think. That will only be almost k-wise independent, but sometimes that's sufficient for the analysis to go through. It will not be easy to re-analyze the KNW algorithm to change its hash families.

To generate a member of a k-wise independent family, use your favorite random number generator to pick the function randomly. In the case of polynomila hashing, that means picking the as referenced above. /dev/random should suffice.

The paper you point to, "An Optimal Algorithm for the Distinct Elements Problem", is a nice one and has been cited many times. However, it is not easy to implement, and it may be slower or even take more space than HyperLogLog, due to hidden constants in the big-O notations. A number of papers have noted the complexity of this algorithm and even called it infeasible compared to HyperLogLog. If you want to implement an estimator for the number of distinct elements, you might start with an earlier algorithm. There is plenty of complexity there if your goal is education. If your goal is practicality, you also want to stay away from KNW, because it could be a lot of work just to make something less practical that HyperLogLog.

As another piece of advice, you should probably ignore the suggestions to "just use Murmur hash" or "pick k values from xxhash" if you want to learn about and understand this algorithm or other random algorithms that use hashing. Murmur/xx might be fine in practice, but they are not k-wise independent families, and some of that advice on this page is not even semantically well-formed. For instance, "if you need k different hash, just re-use the same algorithm k times, with k different seeds" isn't relevant to k-wise independent families. For this algorithm you want to implement, you'll end up apply the hash functions an arbitrary number of times. You don't "need k different hash", you need n different hash values generated by first picking randomly from a k-independent hash family and second applying the chosen function to the streaming keys that are the input to algorithms like this.




回答4:


Just use a good non-cryptographic hash function. This advice perhaps will make me unpopular with my colleagues in theoretical computer science, but consider your adversary.

  1. Nature. Yeah, maybe it'll hit the minuscule fraction inputs that cause your hash function to behave badly, but there are plenty of other ways for things to go wrong that a k-wise independent hash family won't fix (e.g., the random number generator that chose the hash function didn't do a good job, bugs, etc.), so you need to test end-to-end anyway.

  2. Oblivious adversary. This is what the theory assumes. Oblivious adversaries cannot look at your random bits. If only they were so nice in real life!

  3. Non-oblivious adversary. Randomness is pointless. Use a binary tree.




回答5:


I'm not 100% sure what you mean by "k-wise independent hash functions", but you can get k distinct hash functions by coming up with two hash functions, and then using linear combinations of them.

I have an example in my bloom filter module: http://stromberg.dnsalias.org/svn/bloom-filter/trunk/bloom_filter_mod.py Ignore the get_bitno_seed_rnd function, look at hash1, hash2 and get_bitno_lin_comb



来源:https://stackoverflow.com/questions/16284317/obtaining-a-k-wise-independent-hash-function

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!