Why is conversion from UTF-8 to ISO-8859-1 not the same in Windows and Linux?

回眸只為那壹抹淺笑 提交于 2019-12-02 04:43:16

问题


I have the following in code to convert from UTF-8 to ISO-8859-1 in a jar file and when I execute this jar in Windows I get one result and in CentOS I get another. Might anyone know why?

public static void main(String[] args) {

  try {

    String x = "Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »";

    Charset utf8charset = Charset.forName("UTF-8");
    Charset iso88591charset = Charset.forName("ISO-8859-1");

    ByteBuffer inputBuffer = ByteBuffer.wrap(x.getBytes());
    CharBuffer data = utf8charset.decode(inputBuffer);

    ByteBuffer outputBuffer = iso88591charset.encode(data);
    byte[] outputData = outputBuffer.array();

    String z = new String(outputData);

    System.out.println(z);
  }
  catch(Exception e) {
    System.out.println(e.getMessage());
  }
}

In Windows, java -jar test.jar > test.txt creates a file containing: Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »

but in CentOS I get: �?, ä, �?, é, �?, ö, �?, ü, �?, «, »


回答1:


These two lines

x.getBytes());

String z = new String(outputData);

are platform and default encoding specific.


This runs as expect on Windows and Linux by avoiding platform specific conversions.

String x = "Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »";

Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");

ByteBuffer inputBuffer = ByteBuffer.wrap(x.getBytes(utf8charset));
CharBuffer data = utf8charset.decode(inputBuffer);

ByteBuffer outputBuffer = iso88591charset.encode(data);
byte[] outputData = outputBuffer.array();

String z = new String(outputData, iso88591charset);

System.out.println(z);

prints

Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »



回答2:


You should first and foremost get the string in correct internal representation in java before even thinking about output. I.E. it should be that:

z.equals("Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »") == true

The above can be verified without any output encoding issues, because it simply prints true or false.

In Windows you already achieved this with

ByteBuffer inputBuffer = ByteBuffer.wrap(x.getBytes());
CharBuffer data = utf8charset.decode(inputBuffer);

Because all you need to go from "Ä, ä, É, é, Ö, ö, Ãœ, ü, ß, «, »" to "Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »" is:

ByteBuffer inputBuffer = ByteBuffer.wrap(x.getBytes( windows1252/*explicit windows1252 works on CentOS too*/));
CharBuffer data = utf8charset.decode(inputBuffer);

After this you do something with ISO-8859-1, which is futile because barely half the characters in your original string can be represented in ISO-8859-1 not to mention you are already done as per above. You can delete the code after utf8charset.decode(inputBuffer)

So now your code could look like:

String x = "Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »";

Charset windows1252 = Charset.forName("Windows-1252");
Charset utf8charset = Charset.forName("UTF-8");

byte[] bytes = x.getBytes(windows1252);
String z = new String(bytes, utf8charset);

                                //Still wondering why you didn't just have this literal to begin with
                                //Check that the strings are internally equal so you know at least that
                                //the code is working

System.out.println(z.equals( "Ä, ä, É, é, Ö, ö, Ü, ü, ß, «, »")); 
System.out.println(z);



回答3:


Three possibilities spring to mind:

  • The encoding you're actually using for your source code may differ by platform
  • The encoding the compiler expects by default may differ by platform (you can specify it on the command line)
  • The platform default encoding used when you call x.getBytes() may differ by platform

It's not clear in what way you're trying to convert from UTF-8 to ISO-8859-1 - because your original data is actually just a String. You're treating the results of calling x.getBytes() as if it were UTF-8-encoded data, but it's just whatever the platform default is...

Likewise when you write:

String z = new String(outputData);

... that's using the platform default encoding. Don't do that.

You don't need the byte buffer stuff at all: just encode using text.getBytes(encoding) and decode using new String(data, encoding).



来源:https://stackoverflow.com/questions/13824859/why-is-conversion-from-utf-8-to-iso-8859-1-not-the-same-in-windows-and-linux

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!