First of all I want to mention that this is not a duplicate of How to parse invalid (bad / not well-formed) XML? because I don\'t have a given invalid (or not well-formed) X
I think the simplest solution is using XML 1.1 (supported by org.w3c.dom
) by using this preprocessor:
version=1.1 encoding=UTF-8 standalone=yes?>
According to Wikipedia the only invalid characters in XML 1.1 are U+0000, surrogates, U+FFFE and U+FFFF
This code snippet ensures you always get a correct XML 1.1 string, omitting illegal chars (might not be what you looks for though if you need the exact same string back):
public static String escape(String orig) {
StringBuilder builder = new StringBuilder();
for (char c : orig.toCharArray()) {
if (c == 0x0 || c == 0xfffe || c == 0xffff || (c >= 0xd800 && c <= 0xdfff)) {
continue;
} else if (c == '\'') {
builder.append("'");
} else if (c == '"') {
builder.append(""");
} else if (c == '&') {
builder.append("&");
} else if (c == '<') {
builder.append("<");
} else if (c == '>') {
builder.append(">");
} else if (c <= 0x1f) {
builder.append("" + ((int) c) + ";");
} else {
builder.append(c);
}
}
return builder.toString();
}