I am working with some XML that holds strings like:
This is a string
Some of the strings that I am passing to the
Another way to remove incorrect XML chars in C# is using XmlConvert.IsXmlChar (Available since .NET Framework 4.0)
public static string RemoveInvalidXmlChars(string content)
{
return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
}
or you may check that all characters are XML-valid:
public static bool CheckValidXmlChars(string content)
{
return content.All(ch => System.Xml.XmlConvert.IsXmlChar(ch));
}
.Net Fiddle
For example, the vertical tab symbol (\v
) is not valid for XML, it is valid UTF-8, but not valid XML 1.0, and even many libraries (including libxml2) miss it and silently output invalid XML.
The only illegal characters are &
, <
and >
(as well as "
or '
in attributes).
They're escaped using XML entities, in this case you want &
for &
.
Really, though, you should use a tool or library that writes XML for you and abstracts this kind of thing away for you so you don't have to worry about it.
In the Woodstox XML processor, invalid characters are classified by this code:
if (c == 0) {
throw new IOException("Invalid null character in text to output");
}
if (c < ' ' || (c >= 0x7F && c <= 0x9F)) {
String msg = "Invalid white space character (0x" + Integer.toHexString(c) + ") in text to output";
if (mXml11) {
msg += " (can only be output using character entity)";
}
throw new IOException(msg);
}
if (c > 0x10FFFF) {
throw new IOException("Illegal unicode character point (0x" + Integer.toHexString(c) + ") to output; max is 0x10FFFF as per RFC");
}
/*
* Surrogate pair in non-quotable (not text or attribute value) content, and non-unicode encoding (ISO-8859-x,
* Ascii)?
*/
if (c >= SURR1_FIRST && c <= SURR2_LAST) {
throw new IOException("Illegal surrogate pair -- can only be output via character entities, which are not allowed in this content");
}
throw new IOException("Invalid XML character (0x"+Integer.toHexString(c)+") in text to output");
Source from here