Why are 5381 and 33 so important in the djb2 algorithm?

前端 未结 4 493
南方客
南方客 2020-12-02 05:41

The djb2 algorithm has a hash function for strings.

unsigned long hash = 5381;
int c;

while (c = *str++)
    hash = ((hash << 5) + hash) + c; /* hash         


        
相关标签:
4条回答
  • 2020-12-02 05:59

    On 5381, Dan Bernstein (djb2) says in this article:

    [...] practically any good multiplier works. I think you're worrying about the fact that 31c + d doesn't cover any reasonable range of hash values if c and d are between 0 and 255. That's why, when I discovered the 33 hash function and started using it in my compressors, I started with a hash value of 5381. I think you'll find that this does just as well as a 261 multiplier.

    The whole thread is here if you're interested.

    Ozan Yigit has a page on hash functions which says:

    [...] the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.
    0 讨论(0)
  • 2020-12-02 06:01

    33 was chosen because:

    1) As stated before, multiplication is easy to compute using shift and add.

    2) As you can see from the shift and add implementation, using 33 makes two copies of most of the input bits in the hash accumulator, and then spreads those bits relatively far apart. This helps produce good avalanching. Using a larger shift would duplicate fewer bits, using a smaller shift would keep bit interactions more local and make it take longer for the interactions to spread.

    3) The shift of 5 is relatively prime to 32 (the number of bits in the register), which helps with avalanching. While there are enough characters left in the string, each bit of an input byte will eventually interact with every preceding bit of input.

    4) The shift of 5 is a good shift amount when considering ASCII character data. An ASCII character can sort of be thought of as a 4-bit character type selector and a 4-bit character-of-type selector. E.g. the digits all have 0x3 in the first 4 bits. So an 8-bit shift would cause bits with a certain meaning to mostly interact with other bits that have the same meaning. A 4-bit or 2-bit shift would similarly produce strong interactions between like-minded bits. The 5-bit shift causes many of the four low order bits of a character to strongly interact with many of the 4-upper bits in the same character.

    As stated elsewhere, the choice of 5381 isn't too important and many other choices should work as well here.

    This is not a fast hash function since it processes it's input a character at a time and doesn't try to use instruction level parallelism. It is, however, easy to write. Quality of the output divided by ease of writing the code is likely to hit a sweet spot.

    On modern processors, multiplication is much faster than it was when this algorithm was developed and other multiplication factors (e.g. 2^13 + 2^5 + 1) may have similar performance, slightly better output, and be slightly easier to write.

    Contrary to an answer above, a good non-cryptographic hash function doesn't want to produce a random output. Instead, given two inputs that are nearly identical, it wants to produce widely different outputs. If you're input values are randomly distributed, you don't need a good hash function, you can just use an arbitrary set of bits from your input. Some of the modern hash functions (Jenkins 3, Murmur, probably CityHash) produce a better distribution of outputs than random given inputs that are highly similar.

    0 讨论(0)
  • 2020-12-02 06:07

    Maybe because 33 == 2^5 + 1 and many hashing algorithms use 2^n + 1 as their multiplier?

    Credit to Jerome Berger

    Update:

    This seems to be borne out by the current version of the software package djb2 originally came from: cdb

    The notes I linked to describe the heart of the hashing algorithm as using h = ((h << 5) + h) ^ c to do the hashing... x << 5 is a fast hardware way to use 2^5 as the multiplier.

    0 讨论(0)
  • 2020-12-02 06:15

    This hash function is similar to a Linear Congruential Generator (LCG - a simple class of functions that generate a series of psuedo-random numbers), which generally has the form:

    X = (a * X) + c;  // "mod M", where M = 2^32 or 2^64 typically
    

    Note the similarity to the djb2 hash function... a=33, M=2^32. In order for an LCG to have a "full period" (i.e. as random as it can be), a must have certain properties:

    • a-1 is divisible by all prime factors of M (a-1 is 32, which is divisible by 2, the only prime factor of 2^32)
    • a-1 is a multiple of 4 if M is a multiple of 4 (yes and yes)

    In addition, c and M are supposed to be relatively prime (which will be true for odd values of c).

    So as you can see, this hash function somewhat resembles a good LCG. And when it comes to hash functions, you want one that produces a "random" distribution of hash values given a realistic set of input strings.

    As for why this hash function is good for strings, I think it has a good balance of being extremely fast, while providing a reasonable distribution of hash values. But I've seen many other hash functions which claim to have much better output characteristics, but involved many more lines of code. For instance see this page about hash functions

    EDIT: This good answer explains why 33 and 5381 were chosen for practical reasons.

    0 讨论(0)
提交回复
热议问题