Short Python alphanumeric hash with minimal collisions

后端 未结 5 1787
隐瞒了意图╮
隐瞒了意图╮ 2020-12-16 09:32

I\'d like to set non-integer primary keys for a table using some kind of hash function. md5() seems to be kind of long (32-characters).

What are some alternative

相关标签:
5条回答
  • 2020-12-16 10:07

    You can use something like base 32 notation. It is more compact than decimal notation, case insensitive and collision-free. Just encode a plain old sequence number to generate a short hash-like code.

    If the key is not for human consumption, you can use base 64 notation, which is case sensitive but a little more compact.

    See http://code.google.com/p/py-cupom/ for an example.

    0 讨论(0)
  • 2020-12-16 10:08

    Below is a solution that uses alphanumeric characters plus a few punctuation characters. It returns very short strings (around 8 characters).

    import binascii, struct
    
    def myhash(s):
        return binascii.b2a_base64(struct.pack('i', hash(s)))
    
    0 讨论(0)
  • 2020-12-16 10:19

    Hashids is a library (with Python support) that creates hashes that you can encode/decode very easily.

    http://hashids.org/python/

    0 讨论(0)
  • 2020-12-16 10:21

    Why don't you just truncate SHA1 or MD5? You'll have more collisions then if you didn't truncate, but it's still better than designing your own. Note that you can base64-encode the truncated hash, rather than using hexadecimal. E.g.

    import base64
    import hashlib
    hasher = hashlib.sha1("The quick brown fox")
    base64.urlsafe_b64encode(hasher.digest()[:10])
    

    You can truncate as little (including not at all) or as much as you want, as long as you understand the tradeoffs.

    EDIT: Since you mentioned URL-safe, you can use urlsafe_b64encode and urlsafe_b64decode, which uses - and _ rather than + and /.

    0 讨论(0)
  • 2020-12-16 10:31

    The smallest builtin hash I am aware of is md5

    >>> import hashlib, base64
    >>> d=hashlib.md5(b"hello worlds").digest(); d=base64.b64encode(d); 
    >>> print(d)
    
    b'S27ylES0wiLdFAGdUpFgCQ=='
    

    Low collision and short are somewhat mutually exclusive due to the birthday paradox

    To make it urlsafe you need to use the function from the base64 module

    >>> import base64
    >>> base64.urlsafe_b64encode(hashlib.md5("hello world").digest())
    'XrY7u-Ae7tCTyyK7j1rNww=='
    

    However there should be no problem storing the 16 byte md5 digest in the database in binary form.

    >>> md5bytes=hashlib.md5("hello world").digest()
    >>> len(md5bytes)
    16
    >>> urllib.quote_plus(md5bytes)
    '%5E%B6%3B%BB%E0%1E%EE%D0%93%CB%22%BB%8FZ%CD%C3'
    

    Python 2

    >>> base64.urlsafe_b64encode(md5bytes)
    'XrY7u-Ae7tCTyyK7j1rNww=='
    

    Python 3

    >>> base64.urlsafe_b64encode(md5bytes).decode('ascii')
    'XrY7u-Ae7tCTyyK7j1rNww=='
    

    You can choose either the quote_plus or the urlsafe_b64encode for your url, then decode with the corresponding function unquote_plus or urlsafe_b64decode before you look them up in the database.

    0 讨论(0)
提交回复
热议问题