Binary Data in JSON String. Something better than Base64

后端 未结 15 1292
一向
一向 2020-11-21 23:03

The JSON format natively doesn\'t support binary data. The binary data has to be escaped so that it can be placed into a string element (i.e. zero or more Unicode chars in d

相关标签:
15条回答
  • 2020-11-21 23:50

    The problem with UTF-8 is that it is not the most space efficient encoding. Also, some random binary byte sequences are invalid UTF-8 encoding. So you can't just interpret a random binary byte sequence as some UTF-8 data because it will be invalid UTF-8 encoding. The benefit of this constrain on the UTF-8 encoding is that it makes it robust and possible to locate multi byte chars start and end whatever byte we start looking at.

    As a consequence, if encoding a byte value in the range [0..127] would need only one byte in UTF-8 encoding, encoding a byte value in the range [128..255] would require 2 bytes ! Worse than that. In JSON, control chars, " and \ are not allowed to appear in a string. So the binary data would require some transformation to be properly encoded.

    Let see. If we assume uniformly distributed random byte values in our binary data then, on average, half of the bytes would be encoded in one bytes and the other half in two bytes. The UTF-8 encoded binary data would have 150% of the initial size.

    Base64 encoding grows only to 133% of the initial size. So Base64 encoding is more efficient.

    What about using another Base encoding ? In UTF-8, encoding the 128 ASCII values is the most space efficient. In 8 bits you can store 7 bits. So if we cut the binary data in 7 bit chunks to store them in each byte of an UTF-8 encoded string, the encoded data would grow only to 114% of the initial size. Better than Base64. Unfortunately we can't use this easy trick because JSON doesn't allow some ASCII chars. The 33 control characters of ASCII ( [0..31] and 127) and the " and \ must be excluded. This leaves us only 128-35 = 93 chars.

    So in theory we could define a Base93 encoding which would grow the encoded size to 8/log2(93) = 8*log10(2)/log10(93) = 122%. But a Base93 encoding would not be as convenient as a Base64 encoding. Base64 requires to cut the input byte sequence in 6bit chunks for which simple bitwise operation works well. Beside 133% is not much more than 122%.

    This is why I came independently to the common conclusion that Base64 is indeed the best choice to encode binary data in JSON. My answer presents a justification for it. I agree it isn't very attractive from the performance point of view, but consider also the benefit of using JSON with it's human readable string representation easy to manipulate in all programming languages.

    If performance is critical than a pure binary encoding should be considered as replacement of JSON. But with JSON my conclusion is that Base64 is the best.

    0 讨论(0)
  • 2020-11-21 23:51

    There are 94 Unicode characters which can be represented as one byte according to the JSON spec (if your JSON is transmitted as UTF-8). With that in mind, I think the best you can do space-wise is base85 which represents four bytes as five characters. However, this is only a 7% improvement over base64, it's more expensive to compute, and implementations are less common than for base64 so it's probably not a win.

    You could also simply map every input byte to the corresponding character in U+0000-U+00FF, then do the minimum encoding required by the JSON standard to pass those characters; the advantage here is that the required decoding is nil beyond builtin functions, but the space efficiency is bad -- a 105% expansion (if all input bytes are equally likely) vs. 25% for base85 or 33% for base64.

    Final verdict: base64 wins, in my opinion, on the grounds that it's common, easy, and not bad enough to warrant replacement.

    See also: Base91 and Base122

    0 讨论(0)
  • 2020-11-21 23:54

    While it is true that base64 has ~33% expansion rate, it is not necessarily true that processing overhead is significantly more than this: it really depends on JSON library/toolkit you are using. Encoding and decoding are simple straight-forward operations, and they can even be optimized wrt character encoding (as JSON only supports UTF-8/16/32) -- base64 characters are always single-byte for JSON String entries. For example on Java platform there are libraries that can do the job rather efficiently, so that overhead is mostly due to expanded size.

    I agree with two earlier answers:

    • base64 is simple, commonly used standard, so it is unlikely to find something better specifically to use with JSON (base-85 is used by postscript etc; but benefits are at best marginal when you think about it)
    • compression before encoding (and after decoding) may make lots of sense, depending on data you use
    0 讨论(0)
  • 2020-11-21 23:54

    Smile format

    It's very fast to encode, decode and compact

    Speed comparison (java based but meaningful nevertheless): https://github.com/eishay/jvm-serializers/wiki/

    Also it's an extension to JSON that allow you to skip base64 encoding for byte arrays

    Smile encoded strings can be gzipped when space is critical

    0 讨论(0)
  • 2020-11-21 23:57

    My solution now, XHR2 is using ArrayBuffer. The ArrayBuffer as binary sequence contains multipart-content, video, audio, graphic, text and so on with multiple content-types. All in One Response.

    In modern browser, having DataView, StringView and Blob for different Components. See also: http://rolfrost.de/video.html for more details.

    0 讨论(0)
  • 2020-11-21 23:58

    If you deal with bandwidth problems, try to compress data at the client side first, then base64-it.

    Nice example of such magic is at http://jszip.stuartk.co.uk/ and more discussion to this topic is at JavaScript implementation of Gzip

    0 讨论(0)
提交回复
热议问题