问题
The Question
What is a good specialization of std::hash for use in the third template parameter of std::unordered_map or std::unordered_set for a user defined type for which all member data types already have a good specialization of std::hash?
For this question, I define "good" as simple to implement and understand, reasonably efficient, and unlikely to produce hash table collisions. The definition of good does not include any statements about security.
The State of What is Google'able
At the moment, two StackOverflow questions are the first hits for a Google search of "std hash specialization".
The first, How to specialize std::hash::operator() for user-defined type in unordered containers?, addresses whether it is legal to open the std namespace and add template specializations.
The second, How to specialize std::hash for type from other library, essentially addresses that same question.
This leaves the current question. Given that implementations of the C++ Standard Library defined hash functions for primitive types and types in the Standard Library, what is a simple and effective way of specializing std::hash for user defined types? Is there a good way to combine hash functions provided by a Standard Library implementation?
(Edit thanks to dyp.) Another question on StackOverflow addresses how to combine a pair of hash functions.
The other Google results are of no more help.
This Dr. Dobbs article states that XOR of two satisfactory hashes will produce a new satisfactory hash.
This articles seems to speak from knowledge and implies many things but is light on details. It contradicts the Dr. Dobbs article in a brief remark in the first example, saying that using XOR to combine hash functions makes for a weak resulting hash function.
Because XOR applied to any two equal values results in 0, I can see why XOR by itself is weak.
The Meta Question
A well reasoned answer explaining why this question is invalid and cannot be answered generally would also be welcome.
回答1:
One easy way is to use boost::hash
library and extend it for your type. It has a nice extension function hash_combine (std::hash
lacks that) that allows easy composition of hashes of individual data members of your structures.
In other words:
- Overload
boost::hash_value
for your own type. - Specialize
std::hash
for your own type and implement it usingboost::hash_value
.
This way you get the best of std and boost worlds, both std::hash<>
and boost::hash<>
work for your type.
A better way is to use the proposed new hashing infrastructure in N3980 Types Don't Know #. This infrastructure makes hash_combine
unnecessary.
回答2:
First, the Dr. Dobbs article which says that XOR of two satisfactory hashes will produce a satisfactory hash is simply wrong. This is a good recipe for poor hashes. In general, to create a good hash, you start by decomposing your object into subobjects, each of which there exists a good hash, and combining the hashs. One simple way of doing this is something like:
class HashAccumulator
{
size_t myValue;
public:
HashAccumulator() : myValue( 2166136261U ) {}
template <typename T>
HashAccumulator& operator+=( T const& nextValue )
{
myValue = 127U * myValue + std::hash<T>( nextHashValue );
}
HashAccumulator operator+( T const& nextHashValue ) const
{
HashAccumulator results( *this );
results += nextHashValue;
return results;
}
};
(This has been designed so that you can use std::accumulate
if
you have a sequence of values.)
Of course, this supposed that all of the subtypes have good
implementations of std::hash
. For the basics types and
strings, this is a given; for your own types, just apply the
above rule recursively, specializing std::hash
to use the
HashAccumulator
on its subtypes. For a standard container of
a basic type, it's a bit trickier, because you're not (formally,
at least) allowed to specialize a standard template on a type
from the standard library; you'll probably have to create
a class which uses of HashAccumulator
directly, and explicitly
specify that if you need a hash of such a container.
回答3:
Until we get a library in the standard to help with this:
- Download a modern hasher, like SpookyHash: http://burtleburtle.net/bob/hash/spooky.html.
- In the definition of
std::hash<YourType>
, create aSpookyHash
instance, andInit
it. Note that picking a random number at either process startup orstd::hash
construction, and using that as the initialization will make it slightly harder to DoS your program, but doesn't fix the problem. - Take each field in your structure that contributes to
operator==
("salient field"), and feed it intoSpookyHash::Update
.- Beware types like
double
: they have 2 representations aschar[]
that compare==
:-0.0
and0.0
. Also beware types that have padding. On most machines,int
doesn't, but it's hard to tell if astruct
will. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#is_contiguously_hashable discusses this. - If you have sub-structures, you'll get a faster, higher-quality hash value from recursively feeding their fields into the same
SpookyHash
instance. However, this requires adding a method to those structures or manually extracting the salient fields: if you can't do this, it's acceptable to just feed theirstd::hash<>
value into the top-levelSpookyHash
instance.
- Beware types like
- Return the output of
SpookyHash::Final
fromstd::hash<YourType>
.
回答4:
Your operation is required to
- Return a value of type
size_t
- Be consistent with the
==
operator. - Have a low probability of a hash collision for unequal values.
There is no explicit requirement that the hash values are evenly distributed over the range of size_t
integers. cppreference.com notes that
some implementations [of the standard library] use trivial (identity) hash functions which map an integer to itself
Avoiding hash collisions coupled with that weakness means that a specialization of std::hash
for your types should never simply use a (fast) bitwise XOR (^
) to combine the sub hashses of your data-members. Consider this example:
struct Point {
uint8_t x;
uint8_t y;
};
namespace std {
template<>
struct hash< Point > {
size_t operator()(const Point &p) const {
return hash< uint8_t >(p.x) ^ hash< uint8_t >(p.y);
}
};
}
The hashes of p.x
will be in the range [0,255], and so will the hashes of p.y
. Therefore the hashes of a Point
will also be in the range [0,255], with 256 (=2^8) possible values. There are 256*256 (=2^16) unique Point
objects (a std::size_t
will typically support 2^32 or 2^64 values). So the probability of a hash collision for a good hashing function should be approximately 2^(-16). Our function gives the probability of a hash collision of just under 2^(-8). This is terrible: our hash provides only 8 bits of information, but a good hash should provide 16 bits of information.
If the hashing functions of your data-members provide only hash values in the low parts of the std::size_t
range, you must "shift" the bits of the component hash before combining them, so they each contribute independent bits of information. Doing a left-wise shift looks simple
return (hash< uint8_t >(p.x) << 8) ^ hash< uint8_t >(p.y);
but that will discard information (due to overflow) if the implementation of hash< uint8_t >
(in this case) attempts to spread the hash-code values over the std::size_t
range.
Accumulating component hash-code values using a multiply-by-prime-and-add method, as typically done in Java, probably works better in general:
namespace std {
template<>
struct hash< Point > {
size_t operator()(const Point &p) const {
const size_t prime = 257;
size_t h {hash< uint8_t >(p.x)};
h = h * prime + hash< uint8_t >(p.y);
return h;
}
};
}
来源:https://stackoverflow.com/questions/24361884/how-to-specialize-stdhasht-for-user-defined-types