Encoding for an XML document containing U+001A

后端 未结 3 599
醉酒成梦
醉酒成梦 2021-01-16 03:53

I have an XML document that\'s being generated from some content that people are copy/pasting from all sorts of places (Word documents mostly though).

It looks like

相关标签:
3条回答
  • 2021-01-16 04:27

    U+001A is not a valid character in an XML document. The valid range of characters according to the specification is:

    Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
    
    0 讨论(0)
  • 2021-01-16 04:30

    Preprocess the original data, encoding Unicode characters not supported by XML documents yourself. for example, use HTML character encodings:

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
        <data> <![CDATA[This is &#x1a; a test.]]></data>        
    </response>
    

    You'll have to post-process the data when read back in to convert the HTML encoding back to the correct Unicode character.

    0 讨论(0)
  • 2021-01-16 04:34

    The character U+001A is in the C0 Controls area, which is mostly (including U+001A) forbidden in XML. It is improbable that anyone entered it on purpose. Rather, it was generated by software, probable when performing character code conversion and detecting malformed data (e.g., a byte that has no defined meaning in the source encoding). The U+001A (SUBSTITUTE) character is meant for such use; see my quick reference to C0 Controls.

    If you cannot track down and fix the conversion (or other process) that produced the U+001A, I’d suggest that you replace it by U+FFFD REPLACEMENT CHARACTER. It’s in a sense the Unicode equivalent of U+001A. (The latter is of course in Unicode too, but disallowed in many contexts.) However it has a visible glyph, though the glyph exists in a few fonts only; check the fileformat.info entry on U+FFFD for more info.

    The point here is that changing U+001A to U+FFFD makes the data acceptable in XML and still retains the information about character-level data error.

    0 讨论(0)
提交回复
热议问题