Java DOM transforming and parsing arbitrary strings with invalid XML characters?

前端 未结 3 555
南笙
南笙 2021-01-19 06:02

First of all I want to mention that this is not a duplicate of How to parse invalid (bad / not well-formed) XML? because I don\'t have a given invalid (or not well-formed) X

相关标签:
3条回答
  • 2021-01-19 06:36

    I think the simplest solution is using XML 1.1 (supported by org.w3c.dom) by using this preprocessor:

    <?xml version=1.1 encoding=UTF-8 standalone=yes?>

    According to Wikipedia the only invalid characters in XML 1.1 are U+0000, surrogates, U+FFFE and U+FFFF

    This code snippet ensures you always get a correct XML 1.1 string, omitting illegal chars (might not be what you looks for though if you need the exact same string back):

    public static String escape(String orig) {
        StringBuilder builder = new StringBuilder();
    
        for (char c : orig.toCharArray()) {
            if (c == 0x0 || c == 0xfffe || c == 0xffff || (c >= 0xd800 && c <= 0xdfff)) {
                continue;
            } else if (c == '\'') {
                builder.append("&apos;");
            } else if (c == '"') {
                builder.append("&quot;");
            } else if (c == '&') {
                builder.append("&amp;");
            } else if (c == '<') {
                builder.append("&lt;");
            } else if (c == '>') {
                builder.append("&gt;");
            } else if (c <= 0x1f) {
                builder.append("&#" + ((int) c) + ";");
            } else {
                builder.append(c);
            }
        }
    
        return builder.toString();
    }
    
    0 讨论(0)
  • 2021-01-19 06:44

    One technique is to encode the whole string as Base64-encoded-UTF8.

    But if the "special" characters are rare, that's a significant sacrifice in readability and file size.

    Another technique is to represent special characters as processing instructions, for example <?U 0000?> for codepoint 0.

    Another would be to use backslash escaping, for example \u0000 for codepoint 0, and of course \ for backslash itself. This has the advantage that you can probably find existing library routines that do this for you (for example JSON conversion libraries). I can't imagine why your requirements say you can't use such libraries; but if you really can't, then it's not hard to write the code yourself.

    0 讨论(0)
  • 2021-01-19 06:54

    As @VGR and @kjhughes have pointed out in the comments below the question, Base64 is indeed a possible answer to my question. I do now have a further solution for my problem, which is based on escaping. I have written 2 functions escapeInvalidXmlCharacters(String string) and unescapeInvalidXmlCharacters(String string) which can be used in the following way.

        String string = "text#text##text#0;text" + '\u0000' + "text<text&text#";
        Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
        Element element = document.createElement("element");
        element.appendChild(document.createTextNode(escapeInvalidXmlCharacters(string)));
        document.appendChild(element);
        TransformerFactory.newInstance().newTransformer().transform(new DOMSource(document), new StreamResult(new File("test.xml")));
        // creates <?xml version="1.0" encoding="UTF-8" standalone="no"?><element>text##text####text##0;text#0;text&lt;text&amp;text##</element>
        document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File("test.xml"));
        System.out.println(unescapeInvalidXmlCharacters(document.getDocumentElement().getTextContent()).equals(string));
        // prints true
    

    escapeInvalidXmlCharacters(String string) and unescapeInvalidXmlCharacters(String string):

    /**
     * Escapes invalid XML Unicode code points in a <code>{@link String}</code>. The
     * DOM API already escapes predefined entities, such as {@code "}, {@code &},
     * {@code '}, {@code <} and {@code >} for
     * <code>{@link org.w3c.dom.Text Text}</code> nodes. Therefore, these Unicode
     * code points are ignored by this function. However, there are some other
     * invalid XML Unicode code points, such as {@code '\u0000'}, which are even
     * invalid in their escaped form, such as {@code "&#0;"}.
     * <p>
     * This function replaces all {@code '#'} by {@code "##"} and all Unicode code
     * points which are not in the ranges #x9 | #xA | #xD | [#x20-#xD7FF] |
     * [#xE000-#xFFFD] | [#x10000-#x10FFFF] by the <code>{@link String}</code>
     * {@code "#c;"}, where <code>c</code> is the Unicode code point.
     * 
     * @param string the <code>{@link String}</code> to be escaped
     * @return the escaped <code>{@link String}</code>
     * @see <code>{@link #unescapeInvalidXmlCharacters(String)}</code>
     */
    public static String escapeInvalidXmlCharacters(String string) {
        StringBuilder stringBuilder = new StringBuilder();
    
        for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
            codePoint = string.codePointAt(i);
    
            if (codePoint == '#') {
                stringBuilder.append("##");
            } else if (codePoint == 0x9 || codePoint == 0xA || codePoint == 0xD || codePoint >= 0x20 && codePoint <= 0xD7FF || codePoint >= 0xE000 && codePoint <= 0xFFFD || codePoint >= 0x10000 && codePoint <= 0x10FFFF) {
                stringBuilder.appendCodePoint(codePoint);
            } else {
                stringBuilder.append("#" + codePoint + ";");
            }
        }
    
        return stringBuilder.toString();
    }
    
    /**
     * Unescapes invalid XML Unicode code points in a <code>{@link String}</code>.
     * Makes <code>{@link #escapeInvalidXmlCharacters(String)}</code> undone.
     * 
     * @param string the <code>{@link String}</code> to be unescaped
     * @return the unescaped <code>{@link String}</code>
     * @see <code>{@link #escapeInvalidXmlCharacters(String)}</code>
     */
    public static String unescapeInvalidXmlCharacters(String string) {
        StringBuilder stringBuilder = new StringBuilder();
        boolean escaped = false;
    
        for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
            codePoint = string.codePointAt(i);
    
            if (escaped) {
                stringBuilder.appendCodePoint(codePoint);
                escaped = false;
            } else if (codePoint == '#') {
                StringBuilder intBuilder = new StringBuilder();
                int j;
    
                for (j = i + 1; j < string.length(); j += Character.charCount(codePoint)) {
                    codePoint = string.codePointAt(j);
    
                    if (codePoint == ';') {
                        escaped = true;
                        break;
                    }
    
                    if (codePoint >= 48 && codePoint <= 57) {
                        intBuilder.appendCodePoint(codePoint);
                    } else {
                        break;
                    }
                }
    
                if (escaped) {
                    try {
                        codePoint = Integer.parseInt(intBuilder.toString());
                        stringBuilder.appendCodePoint(codePoint);
                        escaped = false;
                        i = j;
                    } catch (IllegalArgumentException e) {
                        codePoint = '#';
                        escaped = true;
                    }
                } else {
                    codePoint = '#';
                    escaped = true;
                }
            } else {
                stringBuilder.appendCodePoint(codePoint);
            }
        }
    
        return stringBuilder.toString();
    }
    

    Note that these functions are probably very inefficient and can be written in a better way. Feel free to post suggestions to improve the code in the comments.

    0 讨论(0)
提交回复
热议问题