问题
How do we shrink/encode a 20 letter string to 6 letters. I found few algorithms address data compression like RLE, Arithmetic coding, Universal code but none of them guarantees 6 letters.
The original string can contain the characters A-Z (upper case), 0-9 ans a dash.
回答1:
If your goal is to losslessly compress or hash an random input string of 20 characters (each character could be [A-Z], [0-9] or -) to an output string of 6 characters. It's theoretically impossible.
In information theory, given a discrete random variable X={x|x1,...,xn}
, the Shannon entropy H(X)
is defined as:
where p(xi)
is the probablity of X = xi
. In your case, X
has 20 of 37 possible characters, so it could be {x|x1,...,xn}
where n = 37^20
. Supposing the 37 characters have the same probability of being (aka the input string is random), then p(xi) = 1/37^20
. So the Shannon entropy of the input is:
. A char
in common computer can hold 8 bit, so that 6 chars can hold 48 bit. There's no way to hold 104 bit information by 6 chars. You need at least 15 chars to hold it instead.
If you do allow the loss and have to hash the 20 chars into 6 chars, then your are trying to hash 37^20
values to 128^6
keys. It could be done, but you would got plenty of hash collisions.
In your case, supposing you hash them with the most uniformity (otherwise it would be worse), for each input value, there would be by average of 5.26 other input values sharing the same hash key with it. By a birthday attack, we could expect to find a collision within approximately 200 million trials. It could be done in less than 10 seconds by a common laptop. So I don't think this would be a safe hashing.
However if you insist to do that, you might want to read Hash function algorithms. It lists a lot of algorithms for your choice. Good luck!
来源:https://stackoverflow.com/questions/20765078/shrink-string-encoding-algorithm