问题
In python2.7, I'm successfully using hash()
to place objects into buckets stored persistently on disk. A mockup code looks like this:
class PersistentDict(object):
def __setitem__(self, key, value):
bucket_index = (hash(key)&0xffffffff) % self.bucket_count
self._store_to_bucket(bucket_index, key, value)
def __getitem__(self, key):
bucket_index = (hash(key)&0xffffffff) % self.bucket_count
return self._fetch_from_bucket(bucket_index)[key]
In python3, hash()
uses a random or fixed salt, which makes it unusable/suboptimal for this [1]. Apparently, it's not possible to use a fixed salt for specific invocations. So, I need an alternative:
- Must be stable across interpreter invocations
- May require parameters supplied at execution time, e.g. setting a salt in the call
- Must support arbitrary objects (anything supported by
dict
/set
)
I've already tried using hash functions from hashlib (slow!) and checksums from zlib (apparently not ideal for hashing, but meh) which work fine with strings/bytes. However, they work only on bytes-like objects, whereas hash()
works with almost everything.
[1] Using hash()
to identify buckets is either:
- Not reliable across interpreter invocations, if salts are random
- Prevents applications from using the random salting feature, if salts are fixed
- Unusable if two
PersistentDict
s were created with different salts
回答1:
I've had success using a combination of hash
and zlib.adler32
. The most straightforward implementation is this:
def hashkey(obj, salt=0):
"""
Create a key suitable for use in hashmaps
:param obj: object for which to create a key
:type: str, bytes, :py:class:`datetime.datetime`, object
:param salt: an optional salt to add to the key value
:type salt: int
:return: numeric key to `obj`
:rtype: int
"""
if obj is None:
return 0
if isinstance(obj, str):
return zlib.adler32(obj.encode(), salt) & 0xffffffff
elif isinstance(obj, bytes):
return zlib.adler32(obj, salt) & 0xffffffff
elif isinstance(obj, datetime_type):
return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
return hash(obj) & 0xffffffff
With Python 3.4.3, this is a lot slower than calling plain hash
, which takes roughly 0.07 usec. For a regular object, hashkey
takes ~1.0 usec instead. 0.8 usec for bytes
and 0.7 for str
.
Overhead is roughly as follows:
- 0.1 usec for the function call (
hash(obj)
vsdef pyhash(obj): return hash(obj)
) - 0.2 usec to 0.5 usec for selecting the hash function via
isinstance
- 0.75 usec for
zlib.adler32
orzlib.crc32
vshash
: ~0.160 usec vs ~ 0.75 usec (adler and crc are +/- 4 usec) - 0.15 usec for
obj.encode()
ofstr
objects ("foobar"
) - 1.5 usec for
str(obj).encode()
ofdatetime.datetime
objects
The most optimization comes from ordering of the if
statements. If one mostly expects plain objects, the following is the fastest I could come up with:
def hashkey_c(obj, salt=0):
if obj.__class__ in hashkey_c.types:
if obj is None:
return 0
if obj.__class__ is str:
return zlib.adler32(obj.encode(), salt) & 0xffffffff
elif obj.__class__ is bytes:
return zlib.adler32(obj, salt) & 0xffffffff
elif obj.__class__ is datetime_type:
return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
return hash(obj) & 0xffffffff
hashkey_c.types = {str, bytes, datetime_type, type(None)}
Total time: ~0.7 usec for str
and bytes
, abysmal for datetime
, 0.35 usec for objects, ints, etc. Using a dict
to map type to hash comparable, if one uses an explicit check on the dict
keys (aka types) separately (i.e. not obj.__class__ in hashkey.dict_types
but obj.__class__ in hashkey.explicit_dict_types
).
Some additional notes:
hash
is not stable across interpreter starts for any object using the default__hash__
implementation, includingNone
- It does not work properly for immutable containers (which define
__hash__
) containing a salted type, e.g.(1, 2, 'three')
来源:https://stackoverflow.com/questions/38009699/alternative-to-python-hash-function-for-arbitrary-objects