问题
In Java, I've been trying to write a String to a file using UTF-8 encoding which will later be read by another program written in a different programming language. While doing so I noticed that the bytes created when encoding a String into a byte array didn't seem to have the correct byte values.
I narrowed down the problem to the symbol "£" which seems to produce incorrect bytes when encoded to UTF-8
byte[] byteArray = "£".getBytes(Charset.forName("UTF-8"));
// Print out the Byte Array of the UTF-8 converted string
// Upcast byte values to print the bytes as unsigned
for (byte signedByte : byteArray) {
System.out.print((signedByte & 0xFF) + " ");
}
This outputs 6 bytes with the decimal values: 239 190 130 239 189 163, in hex this is: ef be 82 ef bd a3
http://www.utf8-chartable.de/ however says that the values for "£" in hex is: c2 a3, the output should then be: 194 163
Other strings seem to produce correct bytes when encoded as UTF-8, so I'm wondering why Java is producing these 6 bytes for "£", and how I should go about properly converting by Strings to byte arrays using UTF-8 encoding
I have also tried
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8");
out.write("£");
out.close();
but this produced the same 6 bytes
回答1:
I suspect the problem is that you're using a string literal in your Java code using an editor which writes it out in one encoding - but then you're compiling without specifying the same encoding. In other words, I suspect that your "£"
string is not actually a single pound sign at all.
This should be easy to validate. For example:
char[] chars = "£".toCharArray();
for (char c : chars) {
System.out.println((int) c);
}
To take this out of the equation, you can specify the string using a pure-ASCII representation using a Unicode escape sequence:
String pound = "\u00a3";
// Now encode as before
I'm sure you'll then get the right bytes. For example:
import java.nio.charset.Charset;
class Test {
public static void main(String[] args) throws Exception {
String pound = "\u00a3";
byte[] bytes = pound.getBytes(Charset.forName("UTF-8"));
for (byte b : bytes) {
System.out.println(b & 0xff); // 194, 163
}
}
}
来源:https://stackoverflow.com/questions/22120246/java-utf-8-encoding-produces-incorrect-output