Java DOM transforming and parsing arbitrary strings with invalid XML characters?

前端 未结 3 559
南笙
南笙 2021-01-19 06:02

First of all I want to mention that this is not a duplicate of How to parse invalid (bad / not well-formed) XML? because I don\'t have a given invalid (or not well-formed) X

3条回答
  •  生来不讨喜
    2021-01-19 06:36

    I think the simplest solution is using XML 1.1 (supported by org.w3c.dom) by using this preprocessor:

    version=1.1 encoding=UTF-8 standalone=yes?>

    According to Wikipedia the only invalid characters in XML 1.1 are U+0000, surrogates, U+FFFE and U+FFFF

    This code snippet ensures you always get a correct XML 1.1 string, omitting illegal chars (might not be what you looks for though if you need the exact same string back):

    public static String escape(String orig) {
        StringBuilder builder = new StringBuilder();
    
        for (char c : orig.toCharArray()) {
            if (c == 0x0 || c == 0xfffe || c == 0xffff || (c >= 0xd800 && c <= 0xdfff)) {
                continue;
            } else if (c == '\'') {
                builder.append("'");
            } else if (c == '"') {
                builder.append(""");
            } else if (c == '&') {
                builder.append("&");
            } else if (c == '<') {
                builder.append("<");
            } else if (c == '>') {
                builder.append(">");
            } else if (c <= 0x1f) {
                builder.append("&#" + ((int) c) + ";");
            } else {
                builder.append(c);
            }
        }
    
        return builder.toString();
    }
    

提交回复
热议问题