How does a hash table work?

后端 未结 15 2052
盖世英雄少女心
盖世英雄少女心 2020-11-22 09:32

I\'m looking for an explanation of how a hash table works - in plain English for a simpleton like me!

For example, I know it takes the key, calculates the hash (I a

相关标签:
15条回答
  • 2020-11-22 09:56

    It's even simpler than that.

    A hashtable is nothing more than an array (usually sparse one) of vectors which contain key/value pairs. The maximum size of this array is typically smaller than the number of items in the set of possible values for the type of data being stored in the hashtable.

    The hash algorithm is used to generate an index into that array based on the values of the item that will be stored in the array.

    This is where storing vectors of key/value pairs in the array come in. Because the set of values that can be indexes in the array is typically smaller than the number of all possible values that the type can have, it is possible that your hash algorithm is going to generate the same value for two separate keys. A good hash algorithm will prevent this as much as possible (which is why it is relegated to the type usually because it has specific information which a general hash algorithm can't possibly know), but it's impossible to prevent.

    Because of this, you can have multiple keys that will generate the same hash code. When that happens, the items in the vector are iterated through, and a direct comparison is done between the key in the vector and the key that is being looked up. If it is found, great and the value associated with the key is returned, otherwise, nothing is returned.

    0 讨论(0)
  • 2020-11-22 09:56

    A hash table totally works on the fact that practical computation follows random access machine model i.e. value at any address in memory can be accessed in O(1) time or constant time.

    So, if I have a universe of keys (set of all possible keys that I can use in a application, e.g. roll no. for student, if it's 4 digit then this universe is a set of numbers from 1 to 9999), and a way to map them to a finite set of numbers of size I can allocate memory in my system, theoretically my hash table is ready.

    Generally, in applications the size of universe of keys is very large than number of elements I want to add to the hash table(I don't wanna waste a 1 GB memory to hash ,say, 10000 or 100000 integer values because they are 32 bit long in binary reprsentaion). So, we use this hashing. It's sort of a mixing kind of "mathematical" operation, which maps my large universe to a small set of values that I can accomodate in memory. In practical cases, often space of a hash table is of the same "order"(big-O) as the (number of elements *size of each element), So, we don't waste much memory.

    Now, a large set mapped to a small set, mapping must be many-to-one. So, different keys will be alloted the same space(?? not fair). There are a few ways to handle this, I just know the popular two of them:

    • Use the space that was to be allocated to the value as a reference to a linked list. This linked list will store one or more values, that come to reside in same slot in many to one mapping. The linked list also contains keys to help someone who comes searching. It's like many people in same apartment, when a delivery-man comes, he goes to the room and asks specifically for the guy.
    • Use a double hash function in an array which gives the same sequence of values every time rather than a single value. When I go to store a value, I see whether the required memory location is free or occupied. If it's free, I can store my value there, if it's occupied I take next value from the sequence and so on until I find a free location and I store my value there. When searching or retreiving the value, I go back on same path as given by the sequence and at each location ask for the vaue if it's there until I find it or search all possible locations in the array.

    Introduction to Algorithms by CLRS provides a very good insight on the topic.

    0 讨论(0)
  • 2020-11-22 09:57

    You take a bunch of things, and an array.

    For each thing, you make up an index for it, called a hash. The important thing about the hash is that it 'scatter' a lot; you don't want two similar things to have similar hashes.

    You put your things into the array at position indicated by the hash. More than one thing can wind up at a given hash, so you store the things in arrays or something else appropriate, which we generally call a bucket.

    When you're looking things up in the hash, you go through the same steps, figuring out the hash value, then seeing what's in the bucket at that location and checking whether it's what you're looking for.

    When your hashing is working well and your array is big enough, there will only be a few things at most at any particular index in the array, so you won't have to look at very much.

    For bonus points, make it so that when your hash table is accessed, it moves the thing found (if any) to the beginning of the bucket, so next time it's the first thing checked.

    0 讨论(0)
  • 2020-11-22 09:57

    How the hash is computed does usually not depend on the hashtable, but on the items added to it. In frameworks/base class libraries such as .net and Java, each object has a GetHashCode() (or similar) method returning a hash code for this object. The ideal hash code algorithm and the exact implementation depends on the data represented by in the object.

    0 讨论(0)
  • 2020-11-22 10:03

    The Basic Idea

    Why do people use dressers to store their clothing? Besides looking trendy and stylish, they have the advantage that every article of clothing has a place where it's supposed to be. If you're looking for a pair of socks, you just check the sock drawer. If you're looking for a shirt, you check the drawer that has your shirts in it. It doesn't matter, when you're looking for socks, how many shirts you have or how many pairs of pants you own, since you don't need to look at them. You just look in the sock drawer and expect to find socks there.

    At a high level, a hash table is a way of storing things that is (kinda sorta ish) like a dresser for clothing. The basic idea is the following:

    • You get some number of locations (drawers) where items can be stored.
    • You come up with some rule that tells you which location (drawer) each item belongs.
    • When you need to find something, you use that rule to determine which drawer to look into.

    The advantage of a system like this is that, assuming your rule isn't too complicated and you have an appropriate number of drawers, you can find what you're looking for pretty quickly by simply looking in the right place.

    When you're putting your clothes away, the "rule" you use might be something like "socks go in the top left drawer, and shirts go in the large middle drawer, etc." When you're storing more abstract data, though, we use something called a hash function to do this for us.

    A reasonable way to think about a hash function is as a black box. You put data in one side, and a number called the hash code comes out of the other. Schematically, it looks something like this:

                  +---------+
                |\|   hash  |/| --> hash code
       data --> |/| function|\|
                  +---------+
    

    All hash functions are deterministic: if you put the same data into the function multiple times, you'll always get the same value coming out the other side. And a good hash function should look more or less random: small changes to the input data should give wildly different hash codes. For example, the hash codes for the string "pudu" and for the string "kudu" will likely be wildly different from one another. (Then again, it's possible that they're the same. After all, if a hash function's outputs should look more or less random, there's a chance we get the same hash code twice.)

    How exactly do you build a hash function? For now, let's go with "decent people shouldn't think too much about that." Mathematicians have worked out better and worse ways to design hash functions, but for our purposes we don't really need to worry too much about the internals. It's plenty good to just think of a hash function as a function that's

    • deterministic (equal inputs give equal outputs), but
    • looks random (it's hard to predict one hash code given another).

    Once we have a hash function, we can build a very simple hash table. We'll make an array of "buckets," which you can think of as being analogous to drawers in our dresser. To store an item in the hash table, we'll compute the hash code of the object and use it as an index in the table, which is analogous to "pick which drawer this item goes in." Then, we put that data item inside the bucket at that index. If that bucket was empty, great! We can put the item there. If that bucket is full, we have some choices of what we can do. A simple approach (called chained hashing) is to treat each bucket as a list of items, the same way that your sock drawer might store multiple socks, and then just add the item to the list at that index.

    To look something up in a hash table, we use basically the same procedure. We begin by computing the hash code for the item to look up, which tells us which bucket (drawer) to look in. If the item is in the table, it has to be in that bucket. Then, we just look at all the items in the bucket and see if our item is in there.

    What's the advantage of doing things this way? Well, assuming we have a large number of buckets, we'd expect that most buckets won't have too many things in them. After all, our hash function kinda sorta ish looks like it has random outputs, so the items are distributed kinda sorta ish evenly across all the buckets. In fact, if we formalize the notion of "our hash function looks kinda random," we can prove that the expected number of items in each bucket is the ratio of the total number of items to the total number of buckets. Therefore, we can find the items we're looking for without having to do too much work.

    The Details

    Explaining how "a hash table" works is a bit tricky because there are many flavors of hash tables. This next section talks about a few general implementation details common to all hash tables, plus some specifics of how different styles of hash tables work.

    A first question that comes up is how you turn a hash code into a table slot index. In the above discussion, I just said "use the hash code as an index," but that's actually not a very good idea. In most programming languages, hash codes work out to 32-bit or 64-bit integers, and you aren't going to be able to use those directly as bucket indices. Instead, a common strategy is to make an array of buckets of some size m, compute the (full 32- or 64-bit) hash codes for your items, then mod them by the size of the table to get an index between 0 and m-1, inclusive. The use of modulus works well here because it's decently fast and does a decent job spreading the full range of hash codes across a smaller range.

    (You sometimes see bitwise operators used here. If your table has a size that's a power of two, say, 2k, then computing the bitwise AND of the hash code and then umber 2k - 1 is equivalent to computing a modulus, and it's significantly faster.)

    The next question is how to pick the right number of buckets. If you pick too many buckets, then most buckets will be empty or have few elements (good for speed - you only have to check a few items per bucket), but you'll use a bunch of space simply storing the buckets (not so great, though maybe you can afford it). The flip side of this holds true as well - if you have too few buckets, then you'll have more elements per bucket on average, making lookups take longer, but you'll use less memory.

    A good compromise is to dynamically change the number of buckets over the lifetime of the hash table. The load factor of a hash table, typically denoted α, is the ratio of the number of elements to the number of buckets. Most hash tables pick some maximum load factor. Once the load factor crosses this limit, the hash table increases its number of slots (say, by doubling), then redistributes the elements from the old table into the new one. This is called rehashing. Assuming the maximum load factor in the table is a constant, this ensures that, assuming you have a good hash function, the expected cost of doing a lookup remains O(1). Insertions now have an amortized expected cost of O(1) because of the cost of periodically rebuilding the table, as is the case with deletions. (Deletions can similarly compact the table if the load factor gets too small.)

    Hashing Strategies

    Up to this point, we've been talking about chained hashing, which is one of many different strategies for building a hash table. As a reminder, chained hashing kinda sorta looks like a clothing dresser - each bucket (drawer) can hold multiple items, and when you do a lookup you check all of those items.

    However, this isn't the only way to build a hash table. There's another family of hash tables that use a strategy called open addressing. The basic idea behind open addressing is to store an array of slots, where each slot can either be empty or hold exactly one item.

    In open addressing, when you perform an insertion, as before, you jump to some slot whose index depends on the hash code computed. If that slot is free, great! You put the item there, and you're done. But what if the slot is already full? In that case, you use some secondary strategy to find a different free slot in which to store the item. The most common strategy for doing this uses an approach called linear probing. In linear probing, if the slot you want is already full, you simply shift to the next slot in the table. If that slot is empty, great! You can put the item there. But if that slot is full, you then move to the next slot in the table, etc. (If you hit the end of the table, just wrap back around to the beginning).

    Linear probing is a surprisingly fast way to build a hash table. CPU caches are optimized for locality of reference, so memory lookups in adjacent memory locations tend to be much faster than memory lookups in scattered locations. Since a linear probing insertion or deletion works by hitting some array slot and then walking linearly forward, it results in few cache misses and ends up being a lot faster than what the theory normally predicts. (And it happens to be the case that the theory predicts it's going to be very fast!)

    Another strategy that's become popular recently is cuckoo hashing. I like to think of cuckoo hashing as the "Frozen" of hash tables. Instead of having one hash table and one hash function, we have two hash tables and two hash functions. Each item can be in exactly one of two places - it's either in the location in the first table given by the first hash function, or it's in the location in the second table given by the second hash function. This means that lookups are worst-case efficient, since you only have to check two spots to see if something is in the table.

    Insertions in cuckoo hashing use a different strategy than before. We start off by seeing if either of the two slots that could hold the item are free. If so, great! We just put the item there. But if that doesn't work, then we pick one of the slots, put the item there, and kick out the item that used to be there. That item has to go somewhere, so we try putting it in the other table at the appropriate slot. If that works, great! If not, we kick an item out of that table and try inserting it into the other table. This process continues until everything comes to rest, or we find ourselves trapped in a cycle. (That latter case is rare, and if it happens we have a bunch of options, like "put it in a secondary hash table" or "choose new hash functions and rebuild the tables.")

    There are many improvements possible for cuckoo hashing, such as using multiple tables, letting each slot hold multiple items, and making a "stash" that holds items that can't fit anywhere else, and this is an active area of research!

    Then there are hybrid approaches. Hopscotch hashing is a mix between open addressing and chained hashing that can be thought of as taking a chained hash table and storing each item in each bucket in a slot near where the item wants to go. This strategy plays well with multithreading. The Swiss table uses the fact that some processors can perform multiple operations in parallel with a single instruction to speed up a linear probing table. Extendible hashing is designed for databases and file systems and uses a mix of a trie and a chained hash table to dynamically increase bucket sizes as individual buckets get loaded. Robin Hood hashing is a variant of linear probing in which items can be moved after being inserted to reduce the variance in how far from home each element can live.

    Further Reading

    For more information about the basics of hash tables, check out these lecture slides on chained hashing and these follow-up slides on linear probing and Robin Hood hashing. You can learn more about cuckoo hashing here and about theoretical properties of hash functions here.

    0 讨论(0)
  • 2020-11-22 10:05

    Lots of answers, but none of them are very visual, and hash tables can easily "click" when visualised.

    Hash tables are often implemented as arrays of linked lists. If we imagine a table storing people's names, after a few insertions it might be laid out in memory as below, where ()-enclosed numbers are hash values of the text/name.

    bucket#  bucket content / linked list
    
    [0]      --> "sue"(780) --> null
    [1]      null
    [2]      --> "fred"(42) --> "bill"(9282) --> "jane"(42) --> null
    [3]      --> "mary"(73) --> null
    [4]      null
    [5]      --> "masayuki"(75) --> "sarwar"(105) --> null
    [6]      --> "margaret"(2626) --> null
    [7]      null
    [8]      --> "bob"(308) --> null
    [9]      null
    

    A few points:

    • each of the array entries (indices [0], [1]...) is known as a bucket, and starts a - possibly empty - linked list of values (aka elements, in this example - people's names)
    • each value (e.g. "fred" with hash 42) is linked from bucket [hash % number_of_buckets] e.g. 42 % 10 == [2]; % is the modulo operator - the remainder when divided by the number of buckets
    • multiple data values may collide at and be linked from the same bucket, most often because their hash values collide after the modulo operation (e.g. 42 % 10 == [2], and 9282 % 10 == [2]), but occasionally because the hash values are the same (e.g. "fred" and "jane" both shown with hash 42 above)
      • most hash tables handle collisions - with slightly reduced performance but no functional confusion - by comparing the full value (here text) of a value being sought or inserted to each value already in the linked list at the hashed-to bucket

    Linked list lengths relate to load factor, not the number of values

    If the table size grows, hash tables implemented as above tend to resize themselves (i.e. create a bigger array of buckets, create new/updated linked lists there-from, delete the old array) to keep the ratio of values to buckets (aka load factor) somewhere in the 0.5 to 1.0 range.

    Hans gives the actual formula for other load factors in a comment below, but for indicative values: with load factor 1 and a cryptographic strength hash function, 1/e (~36.8%) of buckets will tend to be empty, another 1/e (~36.8%) have one element, 1/(2e) or ~18.4% two elements, 1/(3!e) about 6.1% three elements, 1/(4!e) or ~1.5% four elements, 1/(5!e) ~.3% have five etc.. - the average chain length from non-empty buckets is ~1.58 no matter how many elements are in the table (i.e. whether there are 100 elements and 100 buckets, or 100 million elements and 100 million buckets), which is why we say lookup/insert/erase are O(1) constant time operations.

    How a hash table can associate keys with values

    Given a hash table implementation as described above, we can imagine creating a value type such as `struct Value { string name; int age; };`, and equality comparison and hash functions that only look at the `name` field (ignoring age), and then something wonderful happens: we can store `Value` records like `{"sue", 63}` in the table, then later search for "sue" without knowing her age, find the stored value and recover or even update her age - happy birthday Sue - which interestingly doesn't change the hash value so doesn't require that we move Sue's record to another bucket.

    When we do this, we're using the hash table as an associative container aka map, and the values it stores can be deemed to consist of a key (the name) and one or more other fields still termed - confusingly - the value (in my example, just the age). A hash table implementation used as a map is known as a hash map.

    This contrasts with the example earlier in this answer where we stored discrete values like "sue", which you could think of as being its own key: that kind of usage is known as a hash set.

    There are other ways to implement a hash table

    Not all hash tables use linked lists (known as separate chaining), but most general purpose ones do, as the main alternative closed hashing (aka open addressing) - particularly with erase operations supported - has less stable performance properties with collision-prone keys/hash functions.


    A few words on hash functions

    Strong hashing...

    A general purpose, worst-case collision-minimising hash function's job is to spray the keys around the hash table buckets effectively at random, while always generating the same hash value for the same key. Even one bit changing anywhere in the key would ideally - randomly - flip about half the bits in the resultant hash value.

    This is normally orchestrated with maths too complicated for me to grok. I'll mention one easy-to-understand way - not the most scalable or cache friendly but inherently elegant (like encryption with a one-time pad!) - as I think it helps drive home the desirable qualities mentioned above. Say you were hashing 64-bit doubles - you could create 8 tables each of 256 random numbers (code below), then use each 8-bit/1-byte slice of the double's memory representation to index into a different table, XORing the random numbers you look up. With this approach, it's easy to see that a bit (in the binary digit sense) changing anywhere in the double results in a different random number being looked up in one of the tables, and a totally uncorrelated final value.

    // note caveats above: cache unfriendly (SLOW) but strong hashing...
    size_t random[8][256] = { ...random data... };
    const char* p = (const char*)&my_double;
    size_t hash = random[0][(unsigned)p[0]] ^
                  random[1][(unsigned)p[1]] ^
                  ... ^
                  random[7][(unsigned)p[7]];
    

    Weak but oft-fast hashing...

    Many libraries' hashing functions pass integers through unchanged (known as a trivial or identity hash function); it's the other extreme from the strong hashing described above. An identity hash is extremely collision prone in the worst cases, but the hope is that in the fairly common case of integer keys that tend to be incrementing (perhaps with some gaps), they'll map into successive buckets leaving fewer empty than random hashing leaves (our ~36.8% at load factor 1 mentioned earlier), thereby having fewer collisions and fewer longer linked lists of colliding elements than is achieved by random mappings. It's also great to save the time it takes to generate a strong hash, and if keys are looked up in order they'll be found in buckets nearby in memory, improving cache hits. When the keys don't increment nicely, the hope is they'll be random enough they won't need a strong hash function to totally randomise their placement into buckets.

    0 讨论(0)
提交回复
热议问题