fast, large-width, non-cryptographic string hashing in python

后端 未结 5 1020
伪装坚强ぢ
伪装坚强ぢ 2020-12-08 02:40

I have a need for a high-performance string hashing function in python that produces integers with at least 34 bits of output (64 bits would make sense, but

相关标签:
5条回答
  • 2020-12-08 03:15

    Use the built-in hash() function. This function, at least on the machine I'm developing for (with python 2.7, and a 64-bit cpu) produces an integer that fits within 32 bits - not large enough for my purposes.

    That's not true. The built-in hash function will generate a 64-bit hash on a 64-bit system.

    This is the python str hashing function from Objects/stringobject.c (Python version 2.7):

    static long
    string_hash(PyStringObject *a)
    {
        register Py_ssize_t len;
        register unsigned char *p;
        register long x;      /* Notice the 64-bit hash, at least on a 64-bit system */
    
        if (a->ob_shash != -1)
        return a->ob_shash;
        len = Py_SIZE(a);
        p = (unsigned char *) a->ob_sval;
        x = *p << 7;
        while (--len >= 0)
            x = (1000003*x) ^ *p++;
        x ^= Py_SIZE(a);
        if (x == -1)
            x = -2;
        a->ob_shash = x;
        return x;
    }
    
    0 讨论(0)
  • 2020-12-08 03:15

    "strings": I'm presuming you wish to hash Python 2.x str objects and/or Python3.x bytes and/or bytearray objects.

    This may violate your first constraint, but: consider using something like

    (zlib.adler32(strg, perturber) << N) ^ hash(strg)
    

    to get a (32+N)-bit hash.

    0 讨论(0)
  • 2020-12-08 03:26

    BE CAREFUL WITH THE BUILT-IN HASH FUNCTION!

    Since Python3, it's fed with a different seed every time the interpreter starts (I don't know more details), thus it generates different values every time -- but not with with native numeric types.

    $ python3 -c 'print(hash("Hello!"), hash(3.14))'
    -1756730906053498061 322818021289917443
    $ python3 -c 'print(hash("Hello!"), hash(3.14))'
    -4556027264747844925 322818021289917443
    $ python3 -c 'print(hash("Hello!"), hash(3.14))'
    -4403217265550417031 322818021289917443
    
    0 讨论(0)
  • 2020-12-08 03:29

    Take a look at the 128-bit variant of MurmurHash3. The algorithm's page includes some performance numbers. Should be possible to port this to Python, pure or as a C extension. (Updated the author recommends using the 128-bit variant and throwing away the bits you don't need).

    If MurmurHash2 64-bit works for you, there is a Python implementation (C extension) in the pyfasthash package, which includes a few other non-cryptographic hash variants, though some of these only offer 32-bit output.

    Update I did a quick Python wrapper for the Murmur3 hash function. Github project is here and you can find it on Python Package Index as well; it just needs a C++ compiler to build; no Boost required.

    Usage example and timing comparison:

    import murmur3
    import timeit
    
    # without seed
    print murmur3.murmur3_x86_64('samplebias')
    # with seed value
    print murmur3.murmur3_x86_64('samplebias', 123)
    
    # timing comparison with str __hash__
    t = timeit.Timer("murmur3.murmur3_x86_64('hello')", "import murmur3")
    print 'murmur3:', t.timeit()
    
    t = timeit.Timer("str.__hash__('hello')")
    print 'str.__hash__:', t.timeit()
    

    Output:

    15662901497824584782
    7997834649920664675
    murmur3: 0.264422178268
    str.__hash__: 0.219163894653
    
    0 讨论(0)
  • 2020-12-08 03:29

    If you can use Python 3.2, the hash result on 64-bit Windows is now a 64-bit value.

    0 讨论(0)
提交回复
热议问题