Converting from Windows 1252 to UTF8 in Java: null characters with CharsetDecoder/Encoder

后端 未结 2 2313
执念已碎
执念已碎 2021-02-20 05:53

I know it\'s a very general question but I\'m becoming mad.

I used this code:

String ucs2Content = new String(bufferToConvert, inputEncoding);        
          


        
相关标签:
2条回答
  • 2021-02-20 06:25

    I am not sure how you get a sequence of null characters. Try this

    String outputEncoding = "UTF-8";
    Charset charsetOutput = Charset.forName(outputEncoding);
    CharsetEncoder encoder = charsetOutput.newEncoder();
    
    // Convert the byte array from starting inputEncoding into UCS2
    byte[] bufferToConvert = "Hello World! £€".getBytes();
    CharBuffer cbuf = decoder.decode(ByteBuffer.wrap(bufferToConvert));
    
    // Convert the internal UCS2 representation into outputEncoding
    ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(cbuf));
    System.out.println(new String(bbuf.array(), 0, bbuf.limit(), charsetOutput));
    

    prints

    Hello World! £€
    
    0 讨论(0)
  • 2021-02-20 06:30

    Your problem is that ByteBuffer.array() returns a direct reference to the array used as backing store for the ByteBuffer and not a copy of the backing array's valid range. You have to obey bbuf.limit() (as Peter did in his response) and just use the array content from index 0 to bbuf.limit()-1.

    The reason for the extra 0 values in the backing array is a slight flaw in how the resulting ByteBuffer is created by the CharsetEncoder. Each CharsetEncoder has an "average bytes per character", which for the UCS2 encoder seem to be simple and correct (2 bytes/char). Obeying this fixed value, the CharsetEncoder initially allocates a ByteBuffer with "string length * average bytes per character" bytes, in this case e.g. 20 bytes for a 10 character long string. The UCS2 CharsetEncoder starts however with a BOM (byte order mark), which also occupies 2 bytes, so that only 9 of the 10 characters fit in the allocated ByteBuffer. The CharsetEncoder detects the overflow and allocates a new ByteBuffer with a length of 2*n+1 (n being the original length of the ByteBuffer), in this case 2*20+1 = 41 bytes. Since only 2 of the 21 new bytes are required to encode the remaining character, the array you get from bbuf.array() will have a length of 41 bytes, but bbuf.limit() will indicate that only the first 22 entries are actually used.

    0 讨论(0)
提交回复
热议问题