Japanese Character Encoding in Java

北城以北 提交于 2021-02-07 14:14:23

问题


Here's my problem. I'm now using using Java Apache POI to read an Excel (.xls or .xlsx) file, and display the contents. There are some Japanese chars in the spreadsheet and all of the Japanese chars I got are "???" in my output. I tried to use Shift-JIS, UTF-8 and many other encoding ways, but it doesn't work... Here's my encoding code below:

public String encoding(String str) throws UnsupportedEncodingException{
  String Encoding = "Shift_JIS";
  return this.changeCharset(str, Encoding);
}
public String changeCharset(String str, String newCharset) throws UnsupportedEncodingException {
  if (str != null) {
    byte[] bs = str.getBytes();
    return new String(bs, newCharset);
  }
  return null;
}

I am passing in every string I got to encoding(str). But when I print the return value, it's still something like "???" (Like below) but not Japanese characters (Hiragana, Katakana or Kanji).

title-jp=???

Anyone can help me with this? Thank you so much.


回答1:


Your changeCharset method seems strange. String objects in Java are best thought of as not have a specific character set. They use Unicode and so can represent all characters, not only one regional subset. Your method says: turn the string into bytes using my system's character set (whatever that may be), and then try and interpret those bytes using some other character set (specified in newCharset), which therefore probably won't work. If you convert to bytes in an encoding, you should read those bytes with the same encoding.

Update:

To convert a String to Shift-JIS (a regional encoding commonly used in Japan) you can say:

byte[] jis = str.getBytes("Shift_JIS");

If you write those bytes into a file, and then open the file in Notepad on a Windows computer where the regional settings are all Japan-centric, Notepad will display it in Japanese (having nothing else to go on, it will assume the text is in the system's local encoding).

However, you could equally well save it as UTF-8 (prefixed with the 3-byte UTF-8 introducer sequence) and Notepad will also display it as Japanese. Shift-JIS is only one way of representing Japanese text as bytes.




回答2:


I suspect you shouldn't be doing this in the first place. If it really is Apache POI's fault, then you'll need to get the original raw bytes from the data, not just use the system default encdoing.

On the other hand, I think it's entirely likely that Apache POI has managed to do the right thing, and it's just an output problem. I suggest you dump the original string you've got (removing your encoding method entirely) in terms of its Unicode code points, e.g.

 for (int i = 0; i < text.length; i++) {
     System.out.println("U+" + Integer.toHexString(text.charAt(i));
 }

Then check those Unicode values against the ones at the Unicode web site.



来源:https://stackoverflow.com/questions/7698794/japanese-character-encoding-in-java

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!