How to specialize std::hash for user defined types?

问题

The Question

What is a good specialization of std::hash for use in the third template parameter of std::unordered_map or std::unordered_set for a user defined type for which all member data types already have a good specialization of std::hash?

For this question, I define "good" as simple to implement and understand, reasonably efficient, and unlikely to produce hash table collisions. The definition of good does not include any statements about security.

The State of What is Google'able

At the moment, two StackOverflow questions are the first hits for a Google search of "std hash specialization".

The first, How to specialize std::hash::operator() for user-defined type in unordered containers?, addresses whether it is legal to open the std namespace and add template specializations.

The second, How to specialize std::hash for type from other library, essentially addresses that same question.

This leaves the current question. Given that implementations of the C++ Standard Library defined hash functions for primitive types and types in the Standard Library, what is a simple and effective way of specializing std::hash for user defined types? Is there a good way to combine hash functions provided by a Standard Library implementation?

(Edit thanks to dyp.) Another question on StackOverflow addresses how to combine a pair of hash functions.

The other Google results are of no more help.

This Dr. Dobbs article states that XOR of two satisfactory hashes will produce a new satisfactory hash.

This articles seems to speak from knowledge and implies many things but is light on details. It contradicts the Dr. Dobbs article in a brief remark in the first example, saying that using XOR to combine hash functions makes for a weak resulting hash function.

Because XOR applied to any two equal values results in 0, I can see why XOR by itself is weak.

The Meta Question

A well reasoned answer explaining why this question is invalid and cannot be answered generally would also be welcome.

回答1:

One easy way is to use boost::hash library and extend it for your type. It has a nice extension function hash_combine (std::hash lacks that) that allows easy composition of hashes of individual data members of your structures.

In other words:

Overload boost::hash_value for your own type.
Specialize std::hash for your own type and implement it using boost::hash_value.

This way you get the best of std and boost worlds, both std::hash<> and boost::hash<> work for your type.

A better way is to use the proposed new hashing infrastructure in N3980 Types Don't Know #. This infrastructure makes hash_combine unnecessary.

回答2:

First, the Dr. Dobbs article which says that XOR of two satisfactory hashes will produce a satisfactory hash is simply wrong. This is a good recipe for poor hashes. In general, to create a good hash, you start by decomposing your object into subobjects, each of which there exists a good hash, and combining the hashs. One simple way of doing this is something like:

class HashAccumulator
{
    size_t myValue;
public:
    HashAccumulator() : myValue( 2166136261U ) {}
    template <typename T>
    HashAccumulator& operator+=( T const& nextValue )
    {
        myValue = 127U * myValue + std::hash<T>( nextHashValue );
    }
    HashAccumulator operator+( T const& nextHashValue ) const
    {
        HashAccumulator results( *this );
        results += nextHashValue;
        return results;
    }
};

(This has been designed so that you can use std::accumulate if you have a sequence of values.)

Of course, this supposed that all of the subtypes have good implementations of std::hash. For the basics types and strings, this is a given; for your own types, just apply the above rule recursively, specializing std::hash to use the HashAccumulator on its subtypes. For a standard container of a basic type, it's a bit trickier, because you're not (formally, at least) allowed to specialize a standard template on a type from the standard library; you'll probably have to create a class which uses of HashAccumulator directly, and explicitly specify that if you need a hash of such a container.

回答3:

Until we get a library in the standard to help with this:

Download a modern hasher, like SpookyHash: http://burtleburtle.net/bob/hash/spooky.html.
In the definition of std::hash<YourType>, create a SpookyHash instance, and Init it. Note that picking a random number at either process startup or std::hash construction, and using that as the initialization will make it slightly harder to DoS your program, but doesn't fix the problem.
Take each field in your structure that contributes to operator== ("salient field"), and feed it into SpookyHash::Update.
- Beware types like double: they have 2 representations as char[] that compare ==: -0.0 and 0.0. Also beware types that have padding. On most machines, int doesn't, but it's hard to tell if a struct will. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#is_contiguously_hashable discusses this.
- If you have sub-structures, you'll get a faster, higher-quality hash value from recursively feeding their fields into the same SpookyHash instance. However, this requires adding a method to those structures or manually extracting the salient fields: if you can't do this, it's acceptable to just feed their std::hash<> value into the top-level SpookyHash instance.
Return the output of SpookyHash::Final from std::hash<YourType>.

回答4:

Your operation is required to

Return a value of type size_t
Be consistent with the == operator.
Have a low probability of a hash collision for unequal values.

There is no explicit requirement that the hash values are evenly distributed over the range of size_t integers. cppreference.com notes that

some implementations [of the standard library] use trivial (identity) hash functions which map an integer to itself

Avoiding hash collisions coupled with that weakness means that a specialization of std::hash for your types should never simply use a (fast) bitwise XOR (^) to combine the sub hashses of your data-members. Consider this example:

 struct Point {
    uint8_t x;
    uint8_t y;
 };

 namespace std {
    template<>
    struct hash< Point > {
       size_t operator()(const Point &p) const {
          return hash< uint8_t >(p.x) ^ hash< uint8_t >(p.y);
       }
    };
 }

The hashes of p.x will be in the range [0,255], and so will the hashes of p.y. Therefore the hashes of a Point will also be in the range [0,255], with 256 (=2^8) possible values. There are 256*256 (=2^16) unique Point objects (a std::size_t will typically support 2^32 or 2^64 values). So the probability of a hash collision for a good hashing function should be approximately 2^(-16). Our function gives the probability of a hash collision of just under 2^(-8). This is terrible: our hash provides only 8 bits of information, but a good hash should provide 16 bits of information.

If the hashing functions of your data-members provide only hash values in the low parts of the std::size_t range, you must "shift" the bits of the component hash before combining them, so they each contribute independent bits of information. Doing a left-wise shift looks simple

       return (hash< uint8_t >(p.x) << 8) ^ hash< uint8_t >(p.y);

but that will discard information (due to overflow) if the implementation of hash< uint8_t > (in this case) attempts to spread the hash-code values over the std::size_t range.

Accumulating component hash-code values using a multiply-by-prime-and-add method, as typically done in Java, probably works better in general:

 namespace std {
    template<>
    struct hash< Point > {
       size_t operator()(const Point &p) const {
          const size_t prime = 257;
          size_t h {hash< uint8_t >(p.x)};
          h = h * prime + hash< uint8_t >(p.y);
          return h;
       }
    };
 }

来源：https://stackoverflow.com/questions/24361884/how-to-specialize-stdhasht-for-user-defined-types

标签

c++