What is the quickest way to hash a large arbitrary object?

问题

I am writing a method to generate cache keys for caching function results, the key is based on a combination of function name and hash value of parameters.

Currently I am using hashlib to hash the serialized version of parameters, however the operation is very expensive to serialize large objects, so what's the alternative?

#get the cache key for storage
def cache_get_key(*args):
    import hashlib
    serialise = []
    for arg in args:
        serialise.append(str(arg))
    key = hashlib.md5("".join(serialise)).hexdigest()
    return key

UPDATE: I tried using hash(str(args)), but if args have relatively large data in it, still takes long time to compute the hash value. Any better way to do it?

Actually str(args) with large data takes forever...

回答1:

Have you tried just using the hash function? It works perfectly well on tuples.

回答2:

Assuming you made the object, and it is composed of smaller components (it is not a binary blob), you can precompute the hash when you build the object by using the hashes of its subcomponents.

For example, rather than serialize(repr(arg)), do arg.precomputedHash if isinstance(arg, ...) else serialize(repr(arg))

If you neither make your own objects nor use hashable objects, you can perhaps keep a memoization table of objectreferences -> hashes, assuming you don't mutate the objects. Worst case, you can use a functional language which allows for memoization, since all objects in such a language are probably immutable and hence hashable.

回答3:

def cache_get_key(*args):
    return hash(str(args))

or (if you really want to use the hashlib library)

def cache_get_key(*args):
    return hashlib.md5(str(args)).hexdigest()

I wouldn't bother rewriting code to make arrays into strings. Use the inbuilt one.

alternative solution

below is the solution @8bitwide suggested. No hashing required at all with this solution!

def foo(x, y):
    return x+y+1

result1 = foo(1,1)
result2 = foo(2,3)

results = {}
results[foo] = {}
results[foo][ [1,1] ] = result1
results[foo][ [2,3] ] = result2

回答4:

I've seen people feed an arbitrary python object to random.seed(), and then use the first value back from random.random() as the "hash" value. It doesn't give a terrific distribution of values (can be skewed), but it seems to work for arbitrary objects.

If you don't need cryptographic-strength hashes, I came up with a pair of hash functions for a list of integers that I use in a bloom filter. They appear below. The bloom filter actually uses linear combinations of these two hash functions to obtain an arbitrarily large number of hash functions, but they should work fine in other contexts that just need a bit of scattering with a decent distribution. They're inspired by Knuth's writing on Linear Congruential Random Number Generation. They take a list of integers as input, which I believe could just be the ord()'s of your serialized characters.

MERSENNES1 = [ 2 ** x - 1 for x in [ 17, 31, 127 ] ]
MERSENNES2 = [ 2 ** x - 1 for x in [ 19, 67, 257 ] ]


def simple_hash(int_list, prime1, prime2, prime3):
    '''Compute a hash value from a list of integers and 3 primes'''
    result = 0
    for integer in int_list:
        result += ((result + integer + prime1) * prime2) % prime3
    return result


def hash1(int_list):
    '''Basic hash function #1'''
    return simple_hash(int_list, MERSENNES1[0], MERSENNES1[1], MERSENNES1[2])


def hash2(int_list):
    '''Basic hash function #2'''
    return simple_hash(int_list, MERSENNES2[0], MERSENNES2[1], MERSENNES2[2])

来源：https://stackoverflow.com/questions/10097064/what-is-the-quickest-way-to-hash-a-large-arbitrary-object

标签

python

caching

hash