I am looking for a way to identify (i.e. encode and decode) a set of Java strings with one token. The identification should not involve DB persistence. So far I
Rot13 obfuscates but does not shorten. Zip shortens (usually) but does not survive the URL round trip. Encryption will not shorten, and may lengthen. Hashing shortens but is one-way. You do not have an easy problem. Base32 is case insensitive, but takes more space than Base64, which isn't. I suspect that you are going to have to drop or modify your requirements. Which requirements are most important and which least important?
I have spent some time on this and I have a good solution for you.
Encode as base64 then as a custom base32 that uses 0-9a-v. Essentially, you lay out the bits 6 at a time (your chars are 0-9a-zA-Z) then encode them 5 at a time. This leads to hardly any extra space. For example, ABCXYZdefxyz123789
encodes as i9crnsuj9ov1h8o4433i14
Here's an implementation that works, including some test code that proves it is case-insensitive:
// Note: You can add 1 more char to this if you want to
static String chars = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
private static String decodeToken(String encoded) {
// Lay out the bits 5 at a time
StringBuilder sb = new StringBuilder();
for (byte b : encoded.toLowerCase().getBytes())
sb.append(asBits(chars.indexOf(b), 5));
sb.setLength(sb.length() - (sb.length() % 6));
// Consume it 6 bits at a time
int length = sb.length();
StringBuilder result = new StringBuilder();
for (int i = 0; i < length; i += 6)
result.append(chars.charAt(Integer.parseInt(sb.substring(i, i + 6), 2)));
return result.toString();
}
private static String generateToken(String x) {
StringBuilder sb = new StringBuilder();
for (byte b : x.getBytes())
sb.append(asBits(chars.indexOf(b), 6));
// Round up to 5 bit multiple
// Consume it 5 bits at a time
int length = sb.length();
sb.append("00000".substring(0, length % 5));
StringBuilder result = new StringBuilder();
for (int i = 0; i < length; i += 5)
result.append(chars.charAt(Integer.parseInt(sb.substring(i, i + 5), 2)));
return result.toString();
}
private static String asBits(int index, int width) {
String bits = "000000" + Integer.toBinaryString(index);
return bits.substring(bits.length() - width);
}
public static void main(String[] args) {
String input = "ABCXYZdefxyz123789";
String token = generateToken(input);
System.out.println(input + " ==> " + token);
Assert.assertEquals("mixed", input, decodeToken(token));
Assert.assertEquals("lower", input, decodeToken(token.toLowerCase()));
Assert.assertEquals("upper", input, decodeToken(token.toUpperCase()));
System.out.println("pass");
}
What's a structure of the text (i.e. set of strings)? You could use your knowledge of it to encode it in a shorten form. E.g. if you have large base-decimal number "1234567890" you could translate it into 36-base number, which will be shorter.
Otherwise it looks like you are trying invent an universal archiver.
If you don't care about length, then yes, processing by alphabet based encoder (such as Base32) is the only choice.
Also, if text is large enough, maybe you could save some space by gzipping it.