I am working with some XML that holds strings like:
This is a string
Some of the strings that I am passing to the
The list of valid characters is in the XML specification:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
In addition to potame's answer, if you do want to escape using a CDATA block.
If you put your text in a CDATA block then you don't need to use escaping. In that case you can use all characters in the following range:
Note: On top of that, you're not allowed to use the ]]>
character sequence. Because it would match the end of the CDATA block.
If there are still invalid characters (e.g. control characters), then probably it's better to use some kind of encoding (e.g. base64).
ampersand (&) is escaped to &
double quotes (") are escaped to "
single quotes (') are escaped to '
less than (<) is escaped to <
greater than (>) is escaped to >
In C#, use System.Security.SecurityElement.Escape
or System.Net.WebUtility.HtmlEncode
to escape these illegal characters.
string xml = "<node>it's my \"node\" & i like it 0x12 x09 x0A 0x09 0x0A <node>";
string encodedXml1 = System.Security.SecurityElement.Escape(xml);
string encodedXml2= System.Net.WebUtility.HtmlEncode(xml);
encodedXml1
"<node>it's my "node" & i like it 0x12 x09 x0A 0x09 0x0A <node>"
encodedXml2
"<node>it's my "node" & i like it 0x12 x09 x0A 0x09 0x0A <node>"
Another easy way to escape potentially unwanted XML / XHTML chars in C# is:
WebUtility.HtmlEncode(stringWithStrangeChars)
"XmlWriter and lower ASCII characters" worked for me
string code = Regex.Replace(item.Code, @"[\u0000-\u0008,\u000B,\u000C,\u000E-\u001F]", "");
For XSL (on really lazy days) I use:
capture="&(?!amp;)" capturereplace="&amp;"
to translate all &-signs that aren't follwed på amp; to proper ones.
We have cases where the input is in CDATA but the system which uses the XML doesn't take it into account. It's a sloppy fix, beware...