How does a hash table work?

后端 未结 15 2071
盖世英雄少女心
盖世英雄少女心 2020-11-22 09:32

I\'m looking for an explanation of how a hash table works - in plain English for a simpleton like me!

For example, I know it takes the key, calculates the hash (I a

相关标签:
15条回答
  • 2020-11-22 10:15

    This turns out to be a pretty deep area of theory, but the basic outline is simple.

    Essentially, a hash function is just a function that takes things from one space (say strings of arbitrary length) and maps them to a space useful for indexing (unsigned integers, say).

    If you only have a small space of things to hash, you might get away with just interpreting those things as integers, and you're done (e.g. 4 byte strings)

    Usually, though, you've got a much larger space. If the space of things you allow as keys is bigger than the space of things you are using to index (your uint32's or whatever) then you can't possibly have a unique value for each one. When two or more things hash to the same result, you'll have to handle the redundancy in an appropriate way (this is usually referred to as a collision, and how you handle it or don't will depend a bit on what you are using the hash for).

    This implies you want it to be unlikely to have the same result, and you probably also would really like the hash function to be fast.

    Balancing these two properties (and a few others) has kept many people busy!

    In practice you usually should be able to find a function that is known to work well for your application and use that.

    Now to make this work as a hashtable: Imagine you didn't care about memory usage. Then you can create an array as long as your indexing set (all uint32's, for example). As you add something to the table, you hash it's key and look at the array at that index. If there is nothing there, you put your value there. If there is already something there, you add this new entry to a list of things at that address, along with enough information (your original key, or something clever) to find which entry actually belongs to which key.

    So as you go a long, every entry in your hashtable (the array) is either empty, or contains one entry, or a list of entries. Retrieving is a simple as indexing into the array, and either returning the value, or walking the list of values and returning the right one.

    Of course in practice you typically can't do this, it wastes too much memory. So you do everything based on a sparse array (where the only entries are the ones you actually use, everything else is implicitly null).

    There are lots of schemes and tricks to make this work better, but that's the basics.

    0 讨论(0)
  • 2020-11-22 10:16

    Here's another way to look at it.

    I assume you understand the concept of an array A. That's something that supports the operation of indexing, where you can get to the Ith element, A[I], in one step, no matter how large A is.

    So, for example, if you want to store information about a group of people who all happen to have different ages, a simple way would be to have an array that is large enough, and use each person's age as an index into the array. Thay way, you could have one-step access to any person's information.

    But of course there could be more than one person with the same age, so what you put in the array at each entry is a list of all the people who have that age. So you can get to an individual person's information in one step plus a little bit of search in that list (called a "bucket"). It only slows down if there are so many people that the buckets get big. Then you need a larger array, and some other way to get more identifying information about the person, like the first few letters of their surname, instead of using age.

    That's the basic idea. Instead of using age, any function of the person that produces a good spread of values can be used. That's the hash function. Like you could take every third bit of the ASCII representation of the person's name, scrambled in some order. All that matters is that you don't want too many people to hash to the same bucket, because the speed depends on the buckets remaining small.

    0 讨论(0)
  • 2020-11-22 10:19

    For all those looking for programming parlance, here is how it works. Internal implementation of advanced hashtables has many intricacies and optimisations for storage allocation/deallocation and search, but top-level idea will be very much the same.

    (void) addValue : (object) value
    {
       int bucket = calculate_bucket_from_val(value);
       if (bucket) 
       {
           //do nothing, just overwrite
       }
       else   //create bucket
       {
          create_extra_space_for_bucket();
       }
       put_value_into_bucket(bucket,value);
    }
    
    (bool) exists : (object) value
    {
       int bucket = calculate_bucket_from_val(value);
       return bucket;
    }
    

    where calculate_bucket_from_val() is the hashing function where all the uniqueness magic must happen.

    The rule of thumb is: For a given value to be inserted, bucket must be UNIQUE & DERIVABLE FROM THE VALUE that it is supposed to STORE.

    Bucket is any space where the values are stored - for here I have kept it int as an array index, but it maybe a memory location as well.

    0 讨论(0)
提交回复
热议问题