What happens under the hood when bytes converted to String in Java?

后端 未结 4 1872
半阙折子戏
半阙折子戏 2021-01-17 17:57

I have a problem when trying to convert bytes to String in Java, with code like:

byte[] bytes = {1, 2, -3};

byte[] transferred = new String(bytes, Charsets.         


        
相关标签:
4条回答
  • 2021-01-17 18:36

    There is a line in the documentation of the constructor:

    This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string.

    This is definitely the culprit here, as -3 is invalid in UTF-8. By the way, if you are really interested, you can always download the source of the rt.jar, and debug into it.

    0 讨论(0)
  • 2021-01-17 18:54

    The encoded values you are getting, [-17, -65, -67] correspond to Unicode code point 0xFFFD. If you look up that code point, the Unicode specification tells you that 0XFFFD "used to replace an incoming character whose value is unknown or unrepresentable in Unicode." And as others have pointed out, -3 without any followup code-units is broken UTF-8, so this character is appropriate.

    0 讨论(0)
  • 2021-01-17 18:58

    In Java, byte is signed, where negative values are above 127. And those you used (-3 = 0xFD, -32 = 0xE0) are not valid in UTF-8, so they both are converted to Unicode codepoint U+FFFD REPLACEMENT CHARACTER, which is converted back to UTF-8 as 0xEF = -17, 0xBF = -65, 0xBD = -67.

    You cannot expect that random byte values are correctly interpreted as UTF-8 text.

    0 讨论(0)
  • 2021-01-17 19:03

    Not all sequences of bytes are valid in UTF-8.

    UTF-8 is a smart scheme with a variable number of bytes per code point, the form of every byte indicating how many other bytes follow for the same code point.

    Refer to this table:

    table

    Now let's see how it applies to your {1, 2, -3}:

    Bytes 1 (hex 0x01, binary 00000001) and 2 (hex 0x02, binary 00000010) stand alone, no problem.

    Byte -3 (hex 0xFD, binary 11111101) is the start byte of a 6-byte sequence (which is actually illegal in the current UTF-8 standard), but your byte array does not have such a sequence.

    Your UTF-8 is invalid. The Java UTF-8 decoder replaces this invalid byte -3 with Unicode codepoint U+FFFD REPLACEMENT CHARACTER (also see this). in UTF-8, codepoint U+FFFD is hex 0xEF 0xBF 0xBD (binary 11101111 10111111 10111101), represented in Java as -17, -65, -67.

    0 讨论(0)
提交回复
热议问题