I have an XML document that\'s being generated from some content that people are copy/pasting from all sorts of places (Word documents mostly though).
It looks like
U+001A is not a valid character in an XML document. The valid range of characters according to the specification is:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Preprocess the original data, encoding Unicode characters not supported by XML documents yourself. for example, use HTML character encodings:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<data> <![CDATA[This is  a test.]]></data>
</response>
You'll have to post-process the data when read back in to convert the HTML encoding back to the correct Unicode character.
The character U+001A is in the C0 Controls area, which is mostly (including U+001A) forbidden in XML. It is improbable that anyone entered it on purpose. Rather, it was generated by software, probable when performing character code conversion and detecting malformed data (e.g., a byte that has no defined meaning in the source encoding). The U+001A (SUBSTITUTE) character is meant for such use; see my quick reference to C0 Controls.
If you cannot track down and fix the conversion (or other process) that produced the U+001A, I’d suggest that you replace it by U+FFFD REPLACEMENT CHARACTER. It’s in a sense the Unicode equivalent of U+001A. (The latter is of course in Unicode too, but disallowed in many contexts.) However it has a visible glyph, though the glyph exists in a few fonts only; check the fileformat.info entry on U+FFFD for more info.
The point here is that changing U+001A to U+FFFD makes the data acceptable in XML and still retains the information about character-level data error.