First of all I want to mention that this is not a duplicate of How to parse invalid (bad / not well-formed) XML? because I don\'t have a given invalid (or not well-formed) X
As @VGR and @kjhughes have pointed out in the comments below the question, Base64 is indeed a possible answer to my question. I do now have a further solution for my problem, which is based on escaping. I have written 2 functions escapeInvalidXmlCharacters(String string)
and unescapeInvalidXmlCharacters(String string)
which can be used in the following way.
String string = "text#text##text#0;text" + '\u0000' + "texttext##text####text##0;text#0;text<text&text##
document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File("test.xml"));
System.out.println(unescapeInvalidXmlCharacters(document.getDocumentElement().getTextContent()).equals(string));
// prints true
escapeInvalidXmlCharacters(String string)
and unescapeInvalidXmlCharacters(String string)
:
/**
* Escapes invalid XML Unicode code points in a {@link String}
. The
* DOM API already escapes predefined entities, such as {@code "}, {@code &},
* {@code '}, {@code <} and {@code >} for
* {@link org.w3c.dom.Text Text}
nodes. Therefore, these Unicode
* code points are ignored by this function. However, there are some other
* invalid XML Unicode code points, such as {@code '\u0000'}, which are even
* invalid in their escaped form, such as {@code ""}.
*
* This function replaces all {@code '#'} by {@code "##"} and all Unicode code
* points which are not in the ranges #x9 | #xA | #xD | [#x20-#xD7FF] |
* [#xE000-#xFFFD] | [#x10000-#x10FFFF] by the {@link String}
* {@code "#c;"}, where c
is the Unicode code point.
*
* @param string the {@link String}
to be escaped
* @return the escaped {@link String}
* @see {@link #unescapeInvalidXmlCharacters(String)}
*/
public static String escapeInvalidXmlCharacters(String string) {
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
codePoint = string.codePointAt(i);
if (codePoint == '#') {
stringBuilder.append("##");
} else if (codePoint == 0x9 || codePoint == 0xA || codePoint == 0xD || codePoint >= 0x20 && codePoint <= 0xD7FF || codePoint >= 0xE000 && codePoint <= 0xFFFD || codePoint >= 0x10000 && codePoint <= 0x10FFFF) {
stringBuilder.appendCodePoint(codePoint);
} else {
stringBuilder.append("#" + codePoint + ";");
}
}
return stringBuilder.toString();
}
/**
* Unescapes invalid XML Unicode code points in a {@link String}
.
* Makes {@link #escapeInvalidXmlCharacters(String)}
undone.
*
* @param string the {@link String}
to be unescaped
* @return the unescaped {@link String}
* @see {@link #escapeInvalidXmlCharacters(String)}
*/
public static String unescapeInvalidXmlCharacters(String string) {
StringBuilder stringBuilder = new StringBuilder();
boolean escaped = false;
for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
codePoint = string.codePointAt(i);
if (escaped) {
stringBuilder.appendCodePoint(codePoint);
escaped = false;
} else if (codePoint == '#') {
StringBuilder intBuilder = new StringBuilder();
int j;
for (j = i + 1; j < string.length(); j += Character.charCount(codePoint)) {
codePoint = string.codePointAt(j);
if (codePoint == ';') {
escaped = true;
break;
}
if (codePoint >= 48 && codePoint <= 57) {
intBuilder.appendCodePoint(codePoint);
} else {
break;
}
}
if (escaped) {
try {
codePoint = Integer.parseInt(intBuilder.toString());
stringBuilder.appendCodePoint(codePoint);
escaped = false;
i = j;
} catch (IllegalArgumentException e) {
codePoint = '#';
escaped = true;
}
} else {
codePoint = '#';
escaped = true;
}
} else {
stringBuilder.appendCodePoint(codePoint);
}
}
return stringBuilder.toString();
}
Note that these functions are probably very inefficient and can be written in a better way. Feel free to post suggestions to improve the code in the comments.