Java DOM transforming and parsing arbitrary strings with invalid XML characters?

前端 未结 3 560
南笙
南笙 2021-01-19 06:02

First of all I want to mention that this is not a duplicate of How to parse invalid (bad / not well-formed) XML? because I don\'t have a given invalid (or not well-formed) X

3条回答
  •  抹茶落季
    2021-01-19 06:54

    As @VGR and @kjhughes have pointed out in the comments below the question, Base64 is indeed a possible answer to my question. I do now have a further solution for my problem, which is based on escaping. I have written 2 functions escapeInvalidXmlCharacters(String string) and unescapeInvalidXmlCharacters(String string) which can be used in the following way.

        String string = "text#text##text#0;text" + '\u0000' + "texttext##text####text##0;text#0;text<text&text##
        document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File("test.xml"));
        System.out.println(unescapeInvalidXmlCharacters(document.getDocumentElement().getTextContent()).equals(string));
        // prints true
    

    escapeInvalidXmlCharacters(String string) and unescapeInvalidXmlCharacters(String string):

    /**
     * Escapes invalid XML Unicode code points in a {@link String}. The
     * DOM API already escapes predefined entities, such as {@code "}, {@code &},
     * {@code '}, {@code <} and {@code >} for
     * {@link org.w3c.dom.Text Text} nodes. Therefore, these Unicode
     * code points are ignored by this function. However, there are some other
     * invalid XML Unicode code points, such as {@code '\u0000'}, which are even
     * invalid in their escaped form, such as {@code "�"}.
     * 

    * This function replaces all {@code '#'} by {@code "##"} and all Unicode code * points which are not in the ranges #x9 | #xA | #xD | [#x20-#xD7FF] | * [#xE000-#xFFFD] | [#x10000-#x10FFFF] by the {@link String} * {@code "#c;"}, where c is the Unicode code point. * * @param string the {@link String} to be escaped * @return the escaped {@link String} * @see {@link #unescapeInvalidXmlCharacters(String)} */ public static String escapeInvalidXmlCharacters(String string) { StringBuilder stringBuilder = new StringBuilder(); for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) { codePoint = string.codePointAt(i); if (codePoint == '#') { stringBuilder.append("##"); } else if (codePoint == 0x9 || codePoint == 0xA || codePoint == 0xD || codePoint >= 0x20 && codePoint <= 0xD7FF || codePoint >= 0xE000 && codePoint <= 0xFFFD || codePoint >= 0x10000 && codePoint <= 0x10FFFF) { stringBuilder.appendCodePoint(codePoint); } else { stringBuilder.append("#" + codePoint + ";"); } } return stringBuilder.toString(); } /** * Unescapes invalid XML Unicode code points in a {@link String}. * Makes {@link #escapeInvalidXmlCharacters(String)} undone. * * @param string the {@link String} to be unescaped * @return the unescaped {@link String} * @see {@link #escapeInvalidXmlCharacters(String)} */ public static String unescapeInvalidXmlCharacters(String string) { StringBuilder stringBuilder = new StringBuilder(); boolean escaped = false; for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) { codePoint = string.codePointAt(i); if (escaped) { stringBuilder.appendCodePoint(codePoint); escaped = false; } else if (codePoint == '#') { StringBuilder intBuilder = new StringBuilder(); int j; for (j = i + 1; j < string.length(); j += Character.charCount(codePoint)) { codePoint = string.codePointAt(j); if (codePoint == ';') { escaped = true; break; } if (codePoint >= 48 && codePoint <= 57) { intBuilder.appendCodePoint(codePoint); } else { break; } } if (escaped) { try { codePoint = Integer.parseInt(intBuilder.toString()); stringBuilder.appendCodePoint(codePoint); escaped = false; i = j; } catch (IllegalArgumentException e) { codePoint = '#'; escaped = true; } } else { codePoint = '#'; escaped = true; } } else { stringBuilder.appendCodePoint(codePoint); } } return stringBuilder.toString(); }

    Note that these functions are probably very inefficient and can be written in a better way. Feel free to post suggestions to improve the code in the comments.

提交回复
热议问题