Java; Trying to convert a String which contains ISO-8859-1 encoding to UTF-8 but file is UTF-8

不羁的心 提交于 2019-12-06 08:18:29

In Simple Words ,if you want to convert charset=iso-8859-1 to java string (which is UTF-8 by default)

 String response= new String(input.getBytes("ISO-8859-1"),"UTF-8");

I think the fundamental problem here is your expectations.

If I understand you correctly, you expect to be able to change Á to à by changing character encodings. That cannot happen. Those are different characters; i.e. different code points - Á is Unicode codepoint 00C1 (or C1 in ISO-8859-1) and à is 00C3 / C3.

So when you transcode a Á in ISO-8859-1 to Unicode to UTF-8 you should get exactly the same character Á. If you don't then the translation would be broken.

You also expect MÉXICO to translate to MÉXICO ... which seems totally bizarre to me. Perhaps there's a problem in your transcription of the characters into the Question ...

Now if the lexicography rules for your language / region say that Á to à are actually equivalent, then it would be reasonable to "normalize" to a preferred form. However, it is not the role of the character encoding / decoding to do such locale-related translations. You need to code it yourself ... or find some other library that does it.


Messing around at the byte level (encoding with one charset and decoding with a different one) is not going to "fix" this. If anything it is going to make things worse. Your messing around is generating byte sequences that can't be mapped to the target encoding scheme ... and hence the question marks.

I finally got it to show the way I specified in the question, I was just using the wrong charset.

intento2 = new String(input.getBytes(Charset.forName("UTF-8")), Charset.forName("Windows-1252"));

This displayed it the way I needed it.

When loading any data from binary representation, you must know what encoding is used for that representation in order to interpret or decode it. If you assume the wrong encoding, then you will probably get garbage -- something that does not make sense.

In order to construct a String from binary data, you have to specify the encoding of the source data. Otherwise you may get garbage -- the constructed String may not contain the characters represented in the source data.

More specifically for your case, if you try to load UTF-8 data using the ISO-8859-1 encoding, you may get garbage. I say "may" because these two encodings actually have a lot of overlap: the low 127 code points (if I remember correctly). If only these low 127 code points are used, the decoding may actually "work", but since this is not guaranteed it should not be relied on.

If you are telling Eclipse to decode your source files using UTF-8, then you should only edit those source files using an editor capable of and configured for editing using UTF-8 encoding.

One more point: The internal representation of String data in Java is UTF-16. Therefore, it is incorrect to say that you have Strings which "contain ISO-8859-1 encoding". If you have a String, you always have UTF-16 data. Whether or not that data is correct or not depends on how you have constructed the String, as discussed above.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!