First of all I want to mention that this is not a duplicate of How to parse invalid (bad / not well-formed) XML? because I don\'t have a given invalid (or not well-formed) X
I think the simplest solution is using XML 1.1 (supported by org.w3c.dom
) by using this preprocessor:
<?xml version=1.1 encoding=UTF-8 standalone=yes?>
According to Wikipedia the only invalid characters in XML 1.1 are U+0000, surrogates, U+FFFE and U+FFFF
This code snippet ensures you always get a correct XML 1.1 string, omitting illegal chars (might not be what you looks for though if you need the exact same string back):
public static String escape(String orig) {
StringBuilder builder = new StringBuilder();
for (char c : orig.toCharArray()) {
if (c == 0x0 || c == 0xfffe || c == 0xffff || (c >= 0xd800 && c <= 0xdfff)) {
continue;
} else if (c == '\'') {
builder.append("'");
} else if (c == '"') {
builder.append(""");
} else if (c == '&') {
builder.append("&");
} else if (c == '<') {
builder.append("<");
} else if (c == '>') {
builder.append(">");
} else if (c <= 0x1f) {
builder.append("&#" + ((int) c) + ";");
} else {
builder.append(c);
}
}
return builder.toString();
}
One technique is to encode the whole string as Base64-encoded-UTF8.
But if the "special" characters are rare, that's a significant sacrifice in readability and file size.
Another technique is to represent special characters as processing instructions, for example <?U 0000?>
for codepoint 0.
Another would be to use backslash escaping, for example \u0000 for codepoint 0, and of course \ for backslash itself. This has the advantage that you can probably find existing library routines that do this for you (for example JSON conversion libraries). I can't imagine why your requirements say you can't use such libraries; but if you really can't, then it's not hard to write the code yourself.
As @VGR and @kjhughes have pointed out in the comments below the question, Base64 is indeed a possible answer to my question. I do now have a further solution for my problem, which is based on escaping. I have written 2 functions escapeInvalidXmlCharacters(String string)
and unescapeInvalidXmlCharacters(String string)
which can be used in the following way.
String string = "text#text##text#0;text" + '\u0000' + "text<text&text#";
Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
Element element = document.createElement("element");
element.appendChild(document.createTextNode(escapeInvalidXmlCharacters(string)));
document.appendChild(element);
TransformerFactory.newInstance().newTransformer().transform(new DOMSource(document), new StreamResult(new File("test.xml")));
// creates <?xml version="1.0" encoding="UTF-8" standalone="no"?><element>text##text####text##0;text#0;text<text&text##</element>
document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File("test.xml"));
System.out.println(unescapeInvalidXmlCharacters(document.getDocumentElement().getTextContent()).equals(string));
// prints true
escapeInvalidXmlCharacters(String string)
and unescapeInvalidXmlCharacters(String string)
:
/**
* Escapes invalid XML Unicode code points in a <code>{@link String}</code>. The
* DOM API already escapes predefined entities, such as {@code "}, {@code &},
* {@code '}, {@code <} and {@code >} for
* <code>{@link org.w3c.dom.Text Text}</code> nodes. Therefore, these Unicode
* code points are ignored by this function. However, there are some other
* invalid XML Unicode code points, such as {@code '\u0000'}, which are even
* invalid in their escaped form, such as {@code "�"}.
* <p>
* This function replaces all {@code '#'} by {@code "##"} and all Unicode code
* points which are not in the ranges #x9 | #xA | #xD | [#x20-#xD7FF] |
* [#xE000-#xFFFD] | [#x10000-#x10FFFF] by the <code>{@link String}</code>
* {@code "#c;"}, where <code>c</code> is the Unicode code point.
*
* @param string the <code>{@link String}</code> to be escaped
* @return the escaped <code>{@link String}</code>
* @see <code>{@link #unescapeInvalidXmlCharacters(String)}</code>
*/
public static String escapeInvalidXmlCharacters(String string) {
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
codePoint = string.codePointAt(i);
if (codePoint == '#') {
stringBuilder.append("##");
} else if (codePoint == 0x9 || codePoint == 0xA || codePoint == 0xD || codePoint >= 0x20 && codePoint <= 0xD7FF || codePoint >= 0xE000 && codePoint <= 0xFFFD || codePoint >= 0x10000 && codePoint <= 0x10FFFF) {
stringBuilder.appendCodePoint(codePoint);
} else {
stringBuilder.append("#" + codePoint + ";");
}
}
return stringBuilder.toString();
}
/**
* Unescapes invalid XML Unicode code points in a <code>{@link String}</code>.
* Makes <code>{@link #escapeInvalidXmlCharacters(String)}</code> undone.
*
* @param string the <code>{@link String}</code> to be unescaped
* @return the unescaped <code>{@link String}</code>
* @see <code>{@link #escapeInvalidXmlCharacters(String)}</code>
*/
public static String unescapeInvalidXmlCharacters(String string) {
StringBuilder stringBuilder = new StringBuilder();
boolean escaped = false;
for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
codePoint = string.codePointAt(i);
if (escaped) {
stringBuilder.appendCodePoint(codePoint);
escaped = false;
} else if (codePoint == '#') {
StringBuilder intBuilder = new StringBuilder();
int j;
for (j = i + 1; j < string.length(); j += Character.charCount(codePoint)) {
codePoint = string.codePointAt(j);
if (codePoint == ';') {
escaped = true;
break;
}
if (codePoint >= 48 && codePoint <= 57) {
intBuilder.appendCodePoint(codePoint);
} else {
break;
}
}
if (escaped) {
try {
codePoint = Integer.parseInt(intBuilder.toString());
stringBuilder.appendCodePoint(codePoint);
escaped = false;
i = j;
} catch (IllegalArgumentException e) {
codePoint = '#';
escaped = true;
}
} else {
codePoint = '#';
escaped = true;
}
} else {
stringBuilder.appendCodePoint(codePoint);
}
}
return stringBuilder.toString();
}
Note that these functions are probably very inefficient and can be written in a better way. Feel free to post suggestions to improve the code in the comments.