I'm currently writing an application for Windows Mobile which needs to be able to pick up key value pairs from 1D barcodes (configuration settings). The less barcodes need to be scanned, the better. Sample input:
------------------------------
| Key | Value |
------------------------------
| 12 | Söme UTF-8 Strîng |
| 9 | & another string |
------------------------------
I thought of the following algorithm:
1. Concat the key value pairs and encode the values with Base64
So we would get something like 12=U8O2bWUgVVRGLTggU3Ryw65uZw==&9=JiBhbm90aGVyIHN0cmluZw==
2. Use Huffman encoding to compress the data
I'd use a fixed Huffman tree for this, with the following information that helps me to compress the data:
-------------------------------------------
| Enties | Priority |
-------------------------------------------
| =, & | High |
| 0-9 | Medium |
| 5-bit Base64 Words (w/o 0-9) | Low |
-------------------------------------------
3. Generate Code 128B barcodes from the encoded data
Apply Base96 encoding to the bit stream generated by the Huffman algorithm to get ASCII chars which can be used within a Code 128B barcode. Split the resulting string into multiple barcodes as required.
Coding this steps won't be a problem for me, but I would like to have some feedback about the efficiency and the design of the algorithm.
Questions
- Am I losing some potential for better compression/shorter strings somewhere?
- Is there a better way to compress the random UTF8 encoded data?
- Should I embed a dynamic Huffman table into the encoded data?
- How can I take the compression of Code 128B into account (a
0
requires less space than a&
)?
One simple method would be to define all 64 characters directly mapped to code128. this would leave 30-40 available code 128 slots. In the remaining slots define some double characters. == =& 0= 1= 2= 3= 4= 5= 6= 7= 8= 9= &0 &1 &2 &2 &5 &5 &6 &7 &8 &9 (repeat last character)= =(double next character) &(double next character)
After a lot of playing and fiddling around, we finally choose this approach:
1. Encode settings into a byte stream
Field values are serialized into a byte stream, with a header for each field. The header consumes one byte and contains the ID of the field and some flags that help to reduce the amount of data to transport. Depending on the type of the field (e.g. a string, a number or an IP address), the value is efficiently encoded into the byte stream. For example, an IP address is encoded with 4 bytes, whereas a boolean flag is directly encoded into the field header. This way, we're capable to encode even SSL certificates into the stream, if required. As the typical barcode formats are not able to transport arbitrary byte values, we need to encode the byte stream in the next step.
2. Convert to barcode format
The resulting byte array is now treated as a big, big integer and converted into the target barcode format using a base encoding and a charset (see this question). This way, we efficiently use the barcode format to transport our data (in contrast to Base64 or other encodings). From the resulting string, we can chunk of single barcodes and add some additional header information to them (e.g. how many barcodes have to be scanned? is the data encrypted? ...).
When the barcodes get scanned on a mobile device, the encoded string can be restored and converted into the same, big integer. This integer can then be treated as a byte array, that can be parsed when the field serialization format is known.
This approach turned out to be very efficient and fast (we had some concerns regarding the BigInteger implementation on CF).
While some barcode formats have a fixed set of characters they can represent and use the same amount of space to hold each character, others either use multiple character sets, or use variable amounts of space to hold each character. For example, "classic" code 39 defines 43 characters, each represented by one of 43 symbols, and simply can't represent anything other characters, but there's another code-39 variant which represents 39 common characters using one symbol, and other characters using two-character sequence. Suppose, for example, one wanted to store a bunch of binary data in a code-39 barcode. If one converted the data to base-64 format, the four characters associated with three octets of raw data would likely take an average of an average of about 5.69 symbols to store [about 27 of the 64 characters used in base64 take two symbols to store in code39]. If one instead chose 32 characters that can be represented by one symbol each, one could store 24 (or 25) bits using five octets to store five bits each [a consistent 1.67 symbols per octet, versus an average of 1.89 and worst case of 2.67]. If one were using "classic" code 39 (which can represent 43 characters using one symbol each), one could even store four octets in six symbols [an average of 1.5 symbols per octet].
Different barcode formats are "optimized" for different character sets; some like Code 128 have multiple character sets, and may be used efficiently with data that uses the full range of one character set, while avoiding using characters outside it. I don't know of any particular recommended approaches for reformatting data so as to optimize the use of a particular symbology's character sets, but examining the encoding used by a symbology and your particular requirements should help you figure out what encoding will work best for your application.
来源:https://stackoverflow.com/questions/15269408/efficient-compression-and-representation-of-key-value-pairs-to-be-read-from-1d-b