Java UTF-8 encoding produces incorrect output

问题

In Java, I've been trying to write a String to a file using UTF-8 encoding which will later be read by another program written in a different programming language. While doing so I noticed that the bytes created when encoding a String into a byte array didn't seem to have the correct byte values.

I narrowed down the problem to the symbol "£" which seems to produce incorrect bytes when encoded to UTF-8

byte[] byteArray = "£".getBytes(Charset.forName("UTF-8"));

// Print out the Byte Array of the UTF-8 converted string
// Upcast byte values to print the bytes as unsigned
for (byte signedByte : byteArray) {
  System.out.print((signedByte & 0xFF) + " ");
}

This outputs 6 bytes with the decimal values: 239 190 130 239 189 163, in hex this is: ef be 82 ef bd a3

http://www.utf8-chartable.de/ however says that the values for "£" in hex is: c2 a3, the output should then be: 194 163

Other strings seem to produce correct bytes when encoded as UTF-8, so I'm wondering why Java is producing these 6 bytes for "£", and how I should go about properly converting by Strings to byte arrays using UTF-8 encoding

I have also tried

OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8");
out.write("£");
out.close();

but this produced the same 6 bytes

回答1:

I suspect the problem is that you're using a string literal in your Java code using an editor which writes it out in one encoding - but then you're compiling without specifying the same encoding. In other words, I suspect that your "£" string is not actually a single pound sign at all.

This should be easy to validate. For example:

char[] chars = "£".toCharArray();
for (char c : chars) {
    System.out.println((int) c);
}

To take this out of the equation, you can specify the string using a pure-ASCII representation using a Unicode escape sequence:

String pound = "\u00a3";
// Now encode as before

I'm sure you'll then get the right bytes. For example:

import java.nio.charset.Charset;

class Test {
    public static void main(String[] args) throws Exception {
        String pound = "\u00a3";
        byte[] bytes = pound.getBytes(Charset.forName("UTF-8"));
        for (byte b : bytes) {
            System.out.println(b & 0xff); // 194, 163
        }
    }
}

来源：https://stackoverflow.com/questions/22120246/java-utf-8-encoding-produces-incorrect-output

标签

java

string

encoding

utf-8