What happens under the hood when bytes converted to String in Java?

后端 未结 4 1879
半阙折子戏
半阙折子戏 2021-01-17 17:57

I have a problem when trying to convert bytes to String in Java, with code like:

byte[] bytes = {1, 2, -3};

byte[] transferred = new String(bytes, Charsets.         


        
4条回答
  •  不思量自难忘°
    2021-01-17 19:03

    Not all sequences of bytes are valid in UTF-8.

    UTF-8 is a smart scheme with a variable number of bytes per code point, the form of every byte indicating how many other bytes follow for the same code point.

    Refer to this table:

    table

    Now let's see how it applies to your {1, 2, -3}:

    Bytes 1 (hex 0x01, binary 00000001) and 2 (hex 0x02, binary 00000010) stand alone, no problem.

    Byte -3 (hex 0xFD, binary 11111101) is the start byte of a 6-byte sequence (which is actually illegal in the current UTF-8 standard), but your byte array does not have such a sequence.

    Your UTF-8 is invalid. The Java UTF-8 decoder replaces this invalid byte -3 with Unicode codepoint U+FFFD REPLACEMENT CHARACTER (also see this). in UTF-8, codepoint U+FFFD is hex 0xEF 0xBF 0xBD (binary 11101111 10111111 10111101), represented in Java as -17, -65, -67.

提交回复
热议问题