Converting UTF-8 to ISO-8859-1 in Java

≡放荡痞女 提交于 2019-11-28 18:55:50

I'm not sure if there is a normalization routine in the standard library that will do this. I do not think conversion of "smart" quotes is handled by the standard Unicode normalizer routines - but don't quote me.

The smart thing to do is to dump ISO-8859-1 and start using UTF-8. That said, it is possible to encode any normally allowed Unicode code point into a HTML page encoded as ISO-8859-1. You can encode them using escape sequences as shown here:

public final class HtmlEncoder {
  private HtmlEncoder() {}

  public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
      T out) throws java.io.IOException {
    for (int i = 0; i < sequence.length(); i++) {
      char ch = sequence.charAt(i);
      if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
        out.append(ch);
      } else {
        int codepoint = Character.codePointAt(sequence, i);
        // handle supplementary range chars
        i += Character.charCount(codepoint) - 1;
        // emit entity
        out.append("&#x");
        out.append(Integer.toHexString(codepoint));
        out.append(";");
      }
    }
    return out;
  }
}

Example usage:

String foo = "This is Cyrillic Ya: \u044F\n"
    + "This is fraktur G: \uD835\uDD0A\n" + "This is a smart quote: \u201C";

StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder());
System.out.println(sb.toString());

Above, the character LEFT DOUBLE QUOTATION MARK ( U+201C ) is encoded as &#x201C;. A couple of other arbitrary code points are likewise encoded.

Care needs to be taken with this approach. If your text needs to be escaped for HTML, that needs to be done before the above code or the ampersands end up being escaped.

Depending on your default encoding, following lines could cause problem,

byte[] latin1 = sb.toString().getBytes("ISO-8859-1");

return new String(latin1);

In Java, String/Char is always in UTF-16BE. Different encoding is only involved when you convert the characters to bytes. Say your default encoding is UTF-8, the latin1 buffer is treated as UTF-8 and some sequence of Latin-1 may form invalid UTF-8 sequence and you will get ?.

robinst

With Java 8, McDowell's answer can be simplified like this (while preserving correct handling of surrogate pairs):

public final class HtmlEncoder {
    private HtmlEncoder() {
    }

    public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
                                                          T out) throws java.io.IOException {
        for (PrimitiveIterator.OfInt iterator = sequence.codePoints().iterator(); iterator.hasNext(); ) {
            int codePoint = iterator.nextInt();
            if (Character.UnicodeBlock.of(codePoint) == Character.UnicodeBlock.BASIC_LATIN) {
                out.append((char) codePoint);
            } else {
                out.append("&#x");
                out.append(Integer.toHexString(codePoint));
                out.append(";");
            }
        }
        return out;
    }
}

when you instanciate your String object, you need to indicate which encoding to use.

So replace :

return new String(latin1);

by

return new String(latin1, "ISO-8859-1");
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!