Is it possible to make a minimal perfect hash function in this situation?

前端未结

关注

 4  741

孤独总比滥情好 2021-02-15 11:07

I want to create a Hash Map (or another structure, if you have any suggestions) to store key value pairs. The keys will all be inserted at once at the same time as the map is cr

4条回答

无人共我 (楼主)

2021-02-15 11:52
There are some very good hashing routines; however, proving one of them to be near-perfect requires a lot of knowledge of the inputs. It seems that your inputs are unconstrained enough to make such a proof near-impossible.

Generally speaking a perfect (or near-perfect) routine is sensitive to each bit/byte of input. For speed, the combination operation is typically XOR. The way that such routines prevent two identical bytes from cancelling each other out is to shift or rotate the bits. However such shifting should be done by a number that is a relative prime to the maximum number that can be represented; otherwise, patterns in the input could partially be cancelled by previous input. This reduces entropy in the solution, increasing chance of collision.

The typical solution is to
```
Start with a number that is a prime (all primes are relative primes)
while (more bytes to be considered) {
  take the next byte of input and multiply it by a second prime
  determine the number of bits that might be lost in a left shift, capture them in a buffer
  shift the bits in the hash "buffer" to the left.
  restore the high order bit(s) in the low position
  take the next byte of input and multiply it by a second prime
  mask the multiplied result into the buffer 
}
```
The problems with such a routine are known. Basically there is a lack of variation in the input, and this makes dispersing the input non-ideal. That said, this technique gives a good dispersion of input bits across the entire domain of outputs provided there is sufficient input to wander away from the initial prime starting number. Unfortunately, picking a random starting number is not a solution, as then it becomes impossible to accurately recompute the hash.

In any case, the prime to be used in the multiplication should not overflow the multiplication. Likewise the capturing of high-order bits must be replaced in the low order if you want to avoid losing dispersion effects of the initial input (and the result becoming grouped around the latter bits / bytes only). Prime number selection effects the dispersion, and sometimes tuning is required for good effect.

By now you should easily be able to see that a near-perfect hash takes more computational time than a decent less-than-near-perfect hash. Hash algorithms are designed to account for collision, and most Java hash structures resize at occupancy thresholds (typically in the 70% range, but it is tunable). Since the resizing is built in, as long as you don't write a terrible hash, the Java data structures will continue to retune you into having less of a chance of collision.

Optimizations which can speed a hash include computing on groups of bits, dropping the occasional byte, pre-computing lookup tables of commonly used multiplied numbers (indexed by input), etc. Don't assume that an optimization is faster, depending on architecture, machine details, and "age" of the optimization, sometimes the assumptions of the optimization no longer hold and applying the optimization actually increases the time to compute the hash.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...