Java 8 UTF-8 encoding issue (java bug?)

后端 未结 3 1917
猫巷女王i
猫巷女王i 2020-11-29 08:15

There is an inconsistency when creating a String with UTF-8 encoding.

Run this code:

public static void encodingIssue() throws IOException {
    byte         


        
相关标签:
3条回答
  • 2020-11-29 08:57

    It is a property of the “Modified UTF-8” encoding to store surrogate pairs (or even unpaired chars of that range) like individual characters. And it’s an error if a decoder claiming to use standard UTF-8 uses “Modified UTF-8”. This seems to have been fixed with Java 8.

    You can reliably read such data using a method that is specified to use “Modified UTF-8”:

    ByteBuffer bb=ByteBuffer.allocate(array.length+2);
    bb.putShort((short)array.length).put(array);
    ByteArrayInputStream bis=new ByteArrayInputStream(bb.array());
    DataInputStream dis=new DataInputStream(bis);
    String str=dis.readUTF();
    
    0 讨论(0)
  • 2020-11-29 09:04

    The value received in Java 1.6/1.7 is U+DEDC (a low surrogate.)

    From RFC 3629:

    The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.

    ...text elided...

    Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding invalid sequences may have security consequences or cause other problems.

    Java 8 decodes this to U+FFFD (REPLACEMENT CHARACTER). This looks like a bug that was fixed in Java 8.

    0 讨论(0)
  • 2020-11-29 09:06

    That is a surrogate, right? I'm not a Unicode expert, but I don't think it has meaning by itself. Java 8 changed to support Unicode 6.2. Maybe it's stricter about this. 65533 is the standard 0xFFFD replacement character, which means, "not representable". Is there a real case where you need to interpret this as a string? because it seems like Unicode is saying that doesn't make sense as a character anymore.

    0 讨论(0)
提交回复
热议问题