Java DOM transforming and parsing arbitrary strings with invalid XML characters?

前端未结

关注

 3  555

First of all I want to mention that this is not a duplicate of How to parse invalid (bad / not well-formed) XML? because I don\'t have a given invalid (or not well-formed) X

相关标签:

3条回答

生来不讨喜

2021-01-19 06:36

I think the simplest solution is using XML 1.1 (supported by org.w3c.dom) by using this preprocessor:

<?xml version=1.1 encoding=UTF-8 standalone=yes?>

According to Wikipedia the only invalid characters in XML 1.1 are U+0000, surrogates, U+FFFE and U+FFFF

This code snippet ensures you always get a correct XML 1.1 string, omitting illegal chars (might not be what you looks for though if you need the exact same string back):

public static String escape(String orig) {
    StringBuilder builder = new StringBuilder();

    for (char c : orig.toCharArray()) {
        if (c == 0x0 || c == 0xfffe || c == 0xffff || (c >= 0xd800 && c <= 0xdfff)) {
            continue;
        } else if (c == '\'') {
            builder.append("&apos;");
        } else if (c == '"') {
            builder.append("&quot;");
        } else if (c == '&') {
            builder.append("&amp;");
        } else if (c == '<') {
            builder.append("&lt;");
        } else if (c == '>') {
            builder.append("&gt;");
        } else if (c <= 0x1f) {
            builder.append("&#" + ((int) c) + ";");
        } else {
            builder.append(c);
        }
    }

    return builder.toString();
}

0 讨论(0)

遇见更好的自我

2021-01-19 06:44

One technique is to encode the whole string as Base64-encoded-UTF8.

But if the "special" characters are rare, that's a significant sacrifice in readability and file size.

Another technique is to represent special characters as processing instructions, for example <?U 0000?> for codepoint 0.

Another would be to use backslash escaping, for example \u0000 for codepoint 0, and of course \ for backslash itself. This has the advantage that you can probably find existing library routines that do this for you (for example JSON conversion libraries). I can't imagine why your requirements say you can't use such libraries; but if you really can't, then it's not hard to write the code yourself.

0 讨论(0)
发布评论:

提交评论
- 加载中...

抹茶落季

2021-01-19 06:54

As @VGR and @kjhughes have pointed out in the comments below the question, Base64 is indeed a possible answer to my question. I do now have a further solution for my problem, which is based on escaping. I have written 2 functions escapeInvalidXmlCharacters(String string) and unescapeInvalidXmlCharacters(String string) which can be used in the following way.

    String string = "text#text##text#0;text" + '\u0000' + "text<text&text#";
    Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
    Element element = document.createElement("element");
    element.appendChild(document.createTextNode(escapeInvalidXmlCharacters(string)));
    document.appendChild(element);
    TransformerFactory.newInstance().newTransformer().transform(new DOMSource(document), new StreamResult(new File("test.xml")));
    // creates <?xml version="1.0" encoding="UTF-8" standalone="no"?><element>text##text####text##0;text#0;text&lt;text&amp;text##</element>
    document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File("test.xml"));
    System.out.println(unescapeInvalidXmlCharacters(document.getDocumentElement().getTextContent()).equals(string));
    // prints true

escapeInvalidXmlCharacters(String string) and unescapeInvalidXmlCharacters(String string):

/**
 * Escapes invalid XML Unicode code points in a <code>{@link String}</code>. The
 * DOM API already escapes predefined entities, such as {@code "}, {@code &},
 * {@code '}, {@code <} and {@code >} for
 * <code>{@link org.w3c.dom.Text Text}</code> nodes. Therefore, these Unicode
 * code points are ignored by this function. However, there are some other
 * invalid XML Unicode code points, such as {@code '\u0000'}, which are even
 * invalid in their escaped form, such as {@code "&#0;"}.
 * <p>
 * This function replaces all {@code '#'} by {@code "##"} and all Unicode code
 * points which are not in the ranges #x9 | #xA | #xD | [#x20-#xD7FF] |
 * [#xE000-#xFFFD] | [#x10000-#x10FFFF] by the <code>{@link String}</code>
 * {@code "#c;"}, where <code>c</code> is the Unicode code point.
 * 
 * @param string the <code>{@link String}</code> to be escaped
 * @return the escaped <code>{@link String}</code>
 * @see <code>{@link #unescapeInvalidXmlCharacters(String)}</code>
 */
public static String escapeInvalidXmlCharacters(String string) {
    StringBuilder stringBuilder = new StringBuilder();

    for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
        codePoint = string.codePointAt(i);

        if (codePoint == '#') {
            stringBuilder.append("##");
        } else if (codePoint == 0x9 || codePoint == 0xA || codePoint == 0xD || codePoint >= 0x20 && codePoint <= 0xD7FF || codePoint >= 0xE000 && codePoint <= 0xFFFD || codePoint >= 0x10000 && codePoint <= 0x10FFFF) {
            stringBuilder.appendCodePoint(codePoint);
        } else {
            stringBuilder.append("#" + codePoint + ";");
        }
    }

    return stringBuilder.toString();
}

/**
 * Unescapes invalid XML Unicode code points in a <code>{@link String}</code>.
 * Makes <code>{@link #escapeInvalidXmlCharacters(String)}</code> undone.
 * 
 * @param string the <code>{@link String}</code> to be unescaped
 * @return the unescaped <code>{@link String}</code>
 * @see <code>{@link #escapeInvalidXmlCharacters(String)}</code>
 */
public static String unescapeInvalidXmlCharacters(String string) {
    StringBuilder stringBuilder = new StringBuilder();
    boolean escaped = false;

    for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
        codePoint = string.codePointAt(i);

        if (escaped) {
            stringBuilder.appendCodePoint(codePoint);
            escaped = false;
        } else if (codePoint == '#') {
            StringBuilder intBuilder = new StringBuilder();
            int j;

            for (j = i + 1; j < string.length(); j += Character.charCount(codePoint)) {
                codePoint = string.codePointAt(j);

                if (codePoint == ';') {
                    escaped = true;
                    break;
                }

                if (codePoint >= 48 && codePoint <= 57) {
                    intBuilder.appendCodePoint(codePoint);
                } else {
                    break;
                }
            }

            if (escaped) {
                try {
                    codePoint = Integer.parseInt(intBuilder.toString());
                    stringBuilder.appendCodePoint(codePoint);
                    escaped = false;
                    i = j;
                } catch (IllegalArgumentException e) {
                    codePoint = '#';
                    escaped = true;
                }
            } else {
                codePoint = '#';
                escaped = true;
            }
        } else {
            stringBuilder.appendCodePoint(codePoint);
        }
    }

    return stringBuilder.toString();
}

Note that these functions are probably very inefficient and can be written in a better way. Feel free to post suggestions to improve the code in the comments.

0 讨论(0)