I\'d like to set non-integer primary keys for a table using some kind of hash function. md5() seems to be kind of long (32-characters).
What are some alternative
You can use something like base 32 notation. It is more compact than decimal notation, case insensitive and collision-free. Just encode a plain old sequence number to generate a short hash-like code.
If the key is not for human consumption, you can use base 64 notation, which is case sensitive but a little more compact.
See http://code.google.com/p/py-cupom/ for an example.
Below is a solution that uses alphanumeric characters plus a few punctuation characters. It returns very short strings (around 8 characters).
import binascii, struct
def myhash(s):
return binascii.b2a_base64(struct.pack('i', hash(s)))
Hashids is a library (with Python support) that creates hashes that you can encode/decode very easily.
http://hashids.org/python/
Why don't you just truncate SHA1 or MD5? You'll have more collisions then if you didn't truncate, but it's still better than designing your own. Note that you can base64-encode the truncated hash, rather than using hexadecimal. E.g.
import base64
import hashlib
hasher = hashlib.sha1("The quick brown fox")
base64.urlsafe_b64encode(hasher.digest()[:10])
You can truncate as little (including not at all) or as much as you want, as long as you understand the tradeoffs.
EDIT: Since you mentioned URL-safe, you can use urlsafe_b64encode and urlsafe_b64decode, which uses -
and _
rather than +
and /
.
The smallest builtin hash I am aware of is md5
>>> import hashlib, base64
>>> d=hashlib.md5(b"hello worlds").digest(); d=base64.b64encode(d);
>>> print(d)
b'S27ylES0wiLdFAGdUpFgCQ=='
Low collision and short are somewhat mutually exclusive due to the birthday paradox
To make it urlsafe you need to use the function from the base64 module
>>> import base64
>>> base64.urlsafe_b64encode(hashlib.md5("hello world").digest())
'XrY7u-Ae7tCTyyK7j1rNww=='
However there should be no problem storing the 16 byte md5 digest in the database in binary form.
>>> md5bytes=hashlib.md5("hello world").digest()
>>> len(md5bytes)
16
>>> urllib.quote_plus(md5bytes)
'%5E%B6%3B%BB%E0%1E%EE%D0%93%CB%22%BB%8FZ%CD%C3'
Python 2
>>> base64.urlsafe_b64encode(md5bytes)
'XrY7u-Ae7tCTyyK7j1rNww=='
Python 3
>>> base64.urlsafe_b64encode(md5bytes).decode('ascii')
'XrY7u-Ae7tCTyyK7j1rNww=='
You can choose either the quote_plus
or the urlsafe_b64encode
for your url, then decode with the corresponding function unquote_plus
or urlsafe_b64decode
before you look them up in the database.