Compress 21 Alphanumeric Characters in to 16 Bytes

后端 未结 8 679
耶瑟儿~
耶瑟儿~ 2020-12-16 15:07

I\'m trying to take 21 bytes of data which uniquely identifies a trade and store it in a 16 byte char array. I\'m having trouble coming up with the right algor

相关标签:
8条回答
  • 2020-12-16 15:15

    You can do this in ~~15bytes (14 bytes and 6 bits).

    For each character from trace_num_ you can save 1 bit if you want save ascii in 7 bits.

    • Then you have 2 bytes free and 2 bits, you must have 5.

    Let get number information, each char can be one from ten values (0 to 9). Then you must have 4 bits to save this character, to save number you must have 1 byte and 4 bits, then you save half of this.

    • Now you have 3 bytes free and 6 bits, you must have 5.

    If you want to use only qwertyuioplkjhgfdsazxcvbnmQWERTYUIOPLKJHGFDSAZXCVBNM1234567890[] You can save each char in 6 bits. Then you have next 2 bytes and 2 bits.

    • Now you have 6 bytes left, and your string can save in 15 bytes + nulltermination = 16bytes.

    And if you save your number in integer on 10 bytes. You can fit this into 14 bytes and 6 bits.

    0 讨论(0)
  • 2020-12-16 15:18

    That makes (18*7+10)=136 bits, or 17 bytes. You wrote trade_num is alphanumeric? If that means the usual [a-zA-Z0-9_] set of characters, then you'd have only 6 bits per character, needing (18*6+10)=118 bit = 15 bytes for the whole thing.

    Assuming 8 bit = 1 byte

    Or, coming from another direction: You have 128 bits for storage, you need ~10 bits for the number part, so there are 118 left for the trade_num. 18 characters means 118/18=6.555 bits per characters, this means you can have only the space to encode 26.555 = 94 different characters **unless there is a hidden structure in trade_num that we could exploit to save more bits.

    0 讨论(0)
  • 2020-12-16 15:21

    This is something that should work, assuming you need only characters from allowedchars, and there is at most 94 characters there. This is python, but it is written trying not to use fancy shortcuts--so that you'll be able to translate it to your destination language easier. It assumes however that the number variable may contain integers up to 2**128--in C++ you should use some kind of big number class.

    allowedchars=' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}'
    alphabase = len(allowedchars)
    
    def compress(code):
        alphanumeric = code[0:18]
        number = int(code[18:21])
    
        for character in alphanumeric:
            # find returns index of character on the allowedchars list
            number = alphabase*number + allowedchars.find(character)
    
        compressed = ''
        for i in xrange(16):
            compressed += chr(number % 256)
            number = number/256
    
        return compressed
    
    def decompress(compressed):
        number = 0
    
        for byte in reversed(compressed):
            number = 256*number + ord(byte)
    
        alphanumeric = ''
        for i in xrange(18):
            alphanumeric = allowedchars[number % alphabase] + alphanumeric
            number = number/alphabase
    
        # make a string padded with zeros
        number = '%03d' % number
    
        return alphanumeric + number
    
    0 讨论(0)
  • 2020-12-16 15:22

    Key questions are:

    There appears to be some contradiction in your post whether the trade number is 16 or 18 characters. You need to clear that up. You say the total is 21 consisting of 16+3. :-(

    You say the trade num characters are in the range 0x00-0x7f. Can they really be any character in that range, including tab, new line, control-C, etc? Or are they limited to printable characters, or maybe even to alphanumerics?

    Does the output 16 bytes have to be printable characters, or is it basically a binary number?

    EDIT, after updates to original post:

    In that case, if the output can be any character in the character set, it's possible. If it can only be printable characters, it's not.

    Demonstration of the mathematical possibility is straightforward enough. There are 94 possible values for each of 18 characters, and 10 possible values for each of 3. Total number of possible combinations = 94 ^ 18 * 10 ^ 3 ~= 3.28E35. This requires 128 bits. 2 ^127 ~= 1.70e38, which is too small, while 2^128 ~= 3.40e38, which is big enough. 128 bits is 16 bytes, so it will just barely fit if we can use every possible bit combination.

    Given the tight fit, I think the most practical way to generate the value is to think of it as a double-long number, and then run the input through an algorithm to generate a unique integer for every possible input.

    Conceptually, then, let's imagine we had a "huge integer" data type that is 16 bytes long. The algorithm would be something like this:

    huge out;
    for (int p=0;p<18;++p)
    {
      out=out*94+tradenum[p]-32;
    }
    for (int p=0;p<3;++p)
    {
      out=out*10+broker[p]-'0';
    }
    
    // Convert output to char[16]
    unsigned char[16] out16;
    for (int p=15;p>=0;--p)
    {
      out16[p]=huge&0xff;
      huge=huge>>8;
    }
    
    return out16;
    

    Of course we don't have a "huge" data type in C. Are you using pure C or C++? Isn't there some kind of big number class in C++? Sorry, I haven't done C++ in a while. If not, we could easily create a little library to implement a huge.

    0 讨论(0)
  • 2020-12-16 15:25

    There are 95 characters between the space (0x20) and tilde (0x7e). (The 94 in other answers suffer from off-by-1 error).

    Hence the number of distinct IDs is 9518×1000 = 3.97×1038.

    But that compressed structure can only hold (28)16 = 3.40×1038 distinct values.

    Therefore it is impossible to represent all IDs by that structure, unless:

    • There is 1 unused character in ≥15 digits of trade_num_, or
    • There are ≥14 unused characters in 1 digit of trade_num_, or
    • There are only ≤856 brokers, or
    • You're using is a PDP-10 which has a 9-bit char.
    0 讨论(0)
  • 2020-12-16 15:28

    If you have 18 characters in the range 0 - 127 and a number in the range 0 - 999 and compact this as much as possible then it will require 17 bytes.

    >>> math.log(128**18 * 1000, 256)
    16.995723035582763
    

    You may be able to take advantage of the fact that some characters are most likely not used. In particular it is unlikely that there are any characters below value 32, and 127 is also probably not used. If you can find one more unused character so you can first convert the characters into base 94 and then pack them into the bytes as closely as possible.

    >>> math.log(94**18 * 1000, 256)
    15.993547951857446
    

    This just fits into 16 bytes!


    Example code

    Here is some example code written in Python (but written in a very imperative style so that it can easily be understood by non-Python programmers). I'm assuming that there are no tildes (~) in the input. If there are you should substitute them with another character before encoding the string.

    def encodeChar(c):
        return ord(c) - 32
    
    def encode(s, n):
        t = 0
        for c in s:
            t = t * 94 + encodeChar(c)
        t = t * 1000 + n
    
        r = []
        for i in range(16):
            r.append(int(t % 256))
            t /= 256
    
        return r
    
    print encode('                  ', 0)    # smallest possible value
    print encode('abcdefghijklmnopqr', 123)
    print encode('}}}}}}}}}}}}}}}}}}', 999)  # largest possible value
    

    Output:

    [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0]
    [ 59, 118, 192, 166, 108,  50, 131, 135, 174,  93,  87, 215, 177,  56, 170, 172]
    [255, 255, 159, 243, 182, 100,  36, 102, 214, 109, 171,  77, 211, 183,   0, 247]
    

    This algorithm uses Python's ability to handle very large numbers. To convert this code to C++ you could use a big integer library.

    You will of course need an equivalent decoding function, the principle is the same - the operations are performed in reverse order.

    0 讨论(0)
提交回复
热议问题