What is the most efficient way to encode an arbitrary GUID into readable ASCII (33-127)?

后端 未结 8 643
自闭症患者
自闭症患者 2020-11-30 22:08

The standard string representation of GUID takes about 36 characters. Which is very nice, but also really wasteful. I am wondering, how to encode it in the shortest possible

相关标签:
8条回答
  • 2020-11-30 22:20

    Assuming that all of your GUIDs are being generated by the same algorithm, you can save 4 bits by not encoding the algorithm nibble, before applying any other encoding :-|

    0 讨论(0)
  • 2020-11-30 22:21

    You have 95 characters available -- so, more than 6 bits, but not quite as many as 7 (about 6.57 actually). You could use 128/log2(95) = about 19.48 characters, to encode into 20 characters. If saving 2 characters in the encoded form is worth the loss of readability to you, something like (pseudocode):

    char encoded[21];
    long long guid;    // 128 bits number
    
    for(int i=0; i<20; ++i) {
      encoded[i] = chr(guid % 95 + 33);
      guid /= 95;
    }
    encoded[20] = chr(0);
    

    which is basically the generic "encode a number in some base" code, except that there's no need to reverse the "digits" since the order's arbitrary anyway (and little-endian is more direct and natural). To get back the guid from the encoded string is, in a very similar way, the polynomial computation in base 95 (after subtracting 33 from each digit of course):

    guid = 0;
    
    for(int i=0; i<20; ++i) {
      guid *= 95;
      guid += ord(encoded[i]) - 33;
    }
    

    essentially using Horner's approach to polynomial evaluation.

    0 讨论(0)
  • 2020-11-30 22:26

    This is an old question, but I had to solve it in order for a system I was working on to be backward compatible.

    The exact requirement was for a client-generated identifier that would be written to the database and stored in a 20-character unique column. It never got shown to the user and was not indexed in any way.

    Since I couldn't eliminate the requirement, I really wanted to use a Guid (which is statistically unique) and if I could encode it losslessly into 20 characters, then it would be a good solution given the constraints.

    Ascii-85 allows you to encode 4 bytes of binary data into 5 bytes of Ascii data. So a 16 byte guid will just fit into 20 Ascii characters using this encoding scheme. A Guid can have 3.1962657931507848761677563491821e+38 discrete values whereas 20 characters of Ascii-85 can have 3.8759531084514355873123178482056e+38 discrete values.

    When writing to the database I had some concerns about truncation so no whitespace characters are included in the encoding. I also had issues with collation, which I addressed by excluding lowercase characters from the encoding. Also, it would only ever be passed through a paramaterized command, so any special SQL characters would be escaped automatically.

    I've included the C# code to perform Ascii-85 encoding and decoding in case it helps anyone out there. Obviously, depending on your usage you might need to choose a different character set as my constraints made me choose some unusual characters like 'ß' and 'Ø' - but that's the easy part:

    /// <summary>
    /// This code implements an encoding scheme that uses 85 printable ascii characters 
    /// to encode the same volume of information as contained in a Guid.
    /// 
    /// Ascii-85 can represent 4 binary bytes as 5 Ascii bytes. So a 16 byte Guid can be 
    /// represented in 20 Ascii bytes. A Guid can have 
    /// 3.1962657931507848761677563491821e+38 discrete values whereas 20 characters of 
    /// Ascii-85 can have 3.8759531084514355873123178482056e+38 discrete values.
    /// 
    /// Lower-case characters are not included in this encoding to avoid collation 
    /// issues. 
    /// This is a departure from standard Ascii-85 which does include lower case 
    /// characters.
    /// In addition, no whitespace characters are included as these may be truncated in 
    /// the database depending on the storage mechanism - ie VARCHAR vs CHAR.
    /// </summary>
    internal static class Ascii85
    {
        /// <summary>
        /// 85 printable ascii characters with no lower case ones, so database 
        /// collation can't bite us. No ' ' character either so database can't 
        /// truncate it!
        /// Unfortunately, these limitation mean resorting to some strange 
        /// characters like 'Æ' but we won't ever have to type these, so it's ok.
        /// </summary>
        private static readonly char[] kEncodeMap = new[]
        { 
            '0','1','2','3','4','5','6','7','8','9',  // 10
            'A','B','C','D','E','F','G','H','I','J',  // 20
            'K','L','M','N','O','P','Q','R','S','T',  // 30
            'U','V','W','X','Y','Z','|','}','~','{',  // 40
            '!','"','#','$','%','&','\'','(',')','`', // 50
            '*','+',',','-','.','/','[','\\',']','^', // 60
            ':',';','<','=','>','?','@','_','¼','½',  // 70
            '¾','ß','Ç','Ð','€','«','»','¿','•','Ø',  // 80
            '£','†','‡','§','¥'                       // 85
        };
    
        /// <summary>
        /// A reverse mapping of the <see cref="kEncodeMap"/> array for decoding 
        /// purposes.
        /// </summary>
        private static readonly IDictionary<char, byte> kDecodeMap;
    
        /// <summary>
        /// Initialises the <see cref="kDecodeMap"/>.
        /// </summary>
        static Ascii85()
        {
            kDecodeMap = new Dictionary<char, byte>();
    
            for (byte i = 0; i < kEncodeMap.Length; i++)
            {
                kDecodeMap.Add(kEncodeMap[i], i);
            }
        }
    
        /// <summary>
        /// Decodes an Ascii-85 encoded Guid.
        /// </summary>
        /// <param name="ascii85Encoding">The Guid encoded using Ascii-85.</param>
        /// <returns>A Guid decoded from the parameter.</returns>
        public static Guid Decode(string ascii85Encoding)
        { 
            // Ascii-85 can encode 4 bytes of binary data into 5 bytes of Ascii.
            // Since a Guid is 16 bytes long, the Ascii-85 encoding should be 20
            // characters long.
            if(ascii85Encoding.Length != 20)
            {
                throw new ArgumentException(
                    "An encoded Guid should be 20 characters long.", 
                    "ascii85Encoding");
            }
    
            // We only support upper case characters.
            ascii85Encoding = ascii85Encoding.ToUpper();
    
            // Split the string in half and decode each substring separately.
            var higher = ascii85Encoding.Substring(0, 10).AsciiDecode();
            var lower = ascii85Encoding.Substring(10, 10).AsciiDecode();
    
            // Convert the decoded substrings into an array of 16-bytes.
            var byteArray = new[]
            {
                (byte)((higher & 0xFF00000000000000) >> 56),        
                (byte)((higher & 0x00FF000000000000) >> 48),        
                (byte)((higher & 0x0000FF0000000000) >> 40),        
                (byte)((higher & 0x000000FF00000000) >> 32),        
                (byte)((higher & 0x00000000FF000000) >> 24),        
                (byte)((higher & 0x0000000000FF0000) >> 16),        
                (byte)((higher & 0x000000000000FF00) >> 8),         
                (byte)((higher & 0x00000000000000FF)),  
                (byte)((lower  & 0xFF00000000000000) >> 56),        
                (byte)((lower  & 0x00FF000000000000) >> 48),        
                (byte)((lower  & 0x0000FF0000000000) >> 40),        
                (byte)((lower  & 0x000000FF00000000) >> 32),        
                (byte)((lower  & 0x00000000FF000000) >> 24),        
                (byte)((lower  & 0x0000000000FF0000) >> 16),        
                (byte)((lower  & 0x000000000000FF00) >> 8),         
                (byte)((lower  & 0x00000000000000FF)),  
            };
    
            return new Guid(byteArray);
        }
    
        /// <summary>
        /// Encodes binary data into a plaintext Ascii-85 format string.
        /// </summary>
        /// <param name="guid">The Guid to encode.</param>
        /// <returns>Ascii-85 encoded string</returns>
        public static string Encode(Guid guid)
        {
            // Convert the 128-bit Guid into two 64-bit parts.
            var byteArray = guid.ToByteArray();
            var higher = 
                ((UInt64)byteArray[0] << 56) | ((UInt64)byteArray[1] << 48) | 
                ((UInt64)byteArray[2] << 40) | ((UInt64)byteArray[3] << 32) |
                ((UInt64)byteArray[4] << 24) | ((UInt64)byteArray[5] << 16) | 
                ((UInt64)byteArray[6] << 8)  | byteArray[7];
    
            var lower = 
                ((UInt64)byteArray[ 8] << 56) | ((UInt64)byteArray[ 9] << 48) | 
                ((UInt64)byteArray[10] << 40) | ((UInt64)byteArray[11] << 32) |
                ((UInt64)byteArray[12] << 24) | ((UInt64)byteArray[13] << 16) | 
                ((UInt64)byteArray[14] << 8)  | byteArray[15];
    
            var encodedStringBuilder = new StringBuilder();
    
            // Encode each part into an ascii-85 encoded string.
            encodedStringBuilder.AsciiEncode(higher);
            encodedStringBuilder.AsciiEncode(lower);
    
            return encodedStringBuilder.ToString();
        }
    
        /// <summary>
        /// Encodes the given integer using Ascii-85.
        /// </summary>
        /// <param name="encodedStringBuilder">The <see cref="StringBuilder"/> to 
        /// append the results to.</param>
        /// <param name="part">The integer to encode.</param>
        private static void AsciiEncode(
            this StringBuilder encodedStringBuilder, UInt64 part)
        {
            // Nb, the most significant digits in our encoded character will 
            // be the right-most characters.
            var charCount = (UInt32)kEncodeMap.Length;
    
            // Ascii-85 can encode 4 bytes of binary data into 5 bytes of Ascii.
            // Since a UInt64 is 8 bytes long, the Ascii-85 encoding should be 
            // 10 characters long.
            for (var i = 0; i < 10; i++)
            {
                // Get the remainder when dividing by the base.
                var remainder = part % charCount;
    
                // Divide by the base.
                part /= charCount;
    
                // Add the appropriate character for the current value (0-84).
                encodedStringBuilder.Append(kEncodeMap[remainder]);
            }
        }
    
        /// <summary>
        /// Decodes the given string from Ascii-85 to an integer.
        /// </summary>
        /// <param name="ascii85EncodedString">Decodes a 10 character Ascii-85 
        /// encoded string.</param>
        /// <returns>The integer representation of the parameter.</returns>
        private static UInt64 AsciiDecode(this string ascii85EncodedString)
        {
            if (ascii85EncodedString.Length != 10)
            {
                throw new ArgumentException(
                    "An Ascii-85 encoded Uint64 should be 10 characters long.", 
                    "ascii85EncodedString");
            }
    
            // Nb, the most significant digits in our encoded character 
            // will be the right-most characters.
            var charCount = (UInt32)kEncodeMap.Length;
            UInt64 result = 0;
    
            // Starting with the right-most (most-significant) character, 
            // iterate through the encoded string and decode.
            for (var i = ascii85EncodedString.Length - 1; i >= 0; i--)
            {
                // Multiply the current decoded value by the base.
                result *= charCount;
    
                // Add the integer value for that encoded character.
                result += kDecodeMap[ascii85EncodedString[i]];
            }
    
            return result;
        }
    }
    

    Also, here are the unit tests. They aren't as thorough as I'd like, and I don't like the non-determinism of where Guid.NewGuid() is used, but they should get you started:

    /// <summary>
    /// Tests to verify that the Ascii-85 encoding is functioning as expected.
    /// </summary>
    [TestClass]
    [UsedImplicitly]
    public class Ascii85Tests
    {
        [TestMethod]
        [Description("Ensure that the Ascii-85 encoding is correct.")]
        [UsedImplicitly]
        public void CanEncodeAndDecodeAGuidUsingAscii85()
        {
            var guidStrings = new[]
            {
                "00000000-0000-0000-0000-000000000000",
                "00000000-0000-0000-0000-0000000000FF",
                "00000000-0000-0000-0000-00000000FF00",
                "00000000-0000-0000-0000-000000FF0000",
                "00000000-0000-0000-0000-0000FF000000",
                "00000000-0000-0000-0000-00FF00000000",
                "00000000-0000-0000-0000-FF0000000000",
                "00000000-0000-0000-00FF-000000000000",
                "00000000-0000-0000-FF00-000000000000",
                "00000000-0000-00FF-0000-000000000000",
                "00000000-0000-FF00-0000-000000000000",
                "00000000-00FF-0000-0000-000000000000",
                "00000000-FF00-0000-0000-000000000000",
                "000000FF-0000-0000-0000-000000000000",
                "0000FF00-0000-0000-0000-000000000000",
                "00FF0000-0000-0000-0000-000000000000",
                "FF000000-0000-0000-0000-000000000000",
                "FF000000-0000-0000-0000-00000000FFFF",
                "00000000-0000-0000-0000-0000FFFF0000",
                "00000000-0000-0000-0000-FFFF00000000",
                "00000000-0000-0000-FFFF-000000000000",
                "00000000-0000-FFFF-0000-000000000000",
                "00000000-FFFF-0000-0000-000000000000",
                "0000FFFF-0000-0000-0000-000000000000",
                "FFFF0000-0000-0000-0000-000000000000",
                "00000000-0000-0000-0000-0000FFFFFFFF",
                "00000000-0000-0000-FFFF-FFFF00000000",
                "00000000-FFFF-FFFF-0000-000000000000",
                "FFFFFFFF-0000-0000-0000-000000000000",
                "00000000-0000-0000-FFFF-FFFFFFFFFFFF",
                "FFFFFFFF-FFFF-FFFF-0000-000000000000",
                "FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF",
                "1000000F-100F-100F-100F-10000000000F"
            };
    
            foreach (var guidString in guidStrings)
            {
                var guid = new Guid(guidString);
                var encoded = Ascii85.Encode(guid);
    
                Assert.AreEqual(
                    20, 
                    encoded.Length, 
                    "A guid encoding should not exceed 20 characters.");
    
                var decoded = Ascii85.Decode(encoded);
    
                Assert.AreEqual(
                    guid, 
                    decoded, 
                    "The guids are different after being encoded and decoded.");
            }
        }
    
        [TestMethod]
        [Description(
            "The Ascii-85 encoding is not susceptible to changes in character case.")]
        [UsedImplicitly]
        public void Ascii85IsCaseInsensitive()
        {
            const int kCount = 50;
    
            for (var i = 0; i < kCount; i++)
            {
                var guid = Guid.NewGuid();
    
                // The encoding should be all upper case. A reliance 
                // on mixed case will make the generated string 
                // vulnerable to sql collation.
                var encoded = Ascii85.Encode(guid);
    
                Assert.AreEqual(
                    encoded, 
                    encoded.ToUpper(), 
                    "The Ascii-85 encoding should produce only uppercase characters.");
            }
        }
    }
    

    I hope this saves somebody some trouble.

    Also, if you find any bugs then let me know ;-)

    0 讨论(0)
  • 2020-11-30 22:33

    Using the full range from 33 (what's wrong wirh space, incidentally?) to 127 gives you 95 possible characters. Expressing the 2^128 possible values of guid in base 95 will use 20 characters. This (modulo things like dropping nybbles that will be constant) is the best you can do. Save yourself the trouble - use base 64.

    0 讨论(0)
  • 2020-11-30 22:34

    Simply go Base64.

    0 讨论(0)
  • 2020-11-30 22:35

    Use Base 85. See section 4.1. Why 85? of A Compact Representation of IPv6 Addresses

    An IPv6 address, like a GUID is made up of eight 16-bit pieces.

    0 讨论(0)
提交回复
热议问题