Encoding for an XML document containing U+001A

后端 未结 3 595
醉酒成梦
醉酒成梦 2021-01-16 03:53

I have an XML document that\'s being generated from some content that people are copy/pasting from all sorts of places (Word documents mostly though).

It looks like

3条回答
  •  隐瞒了意图╮
    2021-01-16 04:34

    The character U+001A is in the C0 Controls area, which is mostly (including U+001A) forbidden in XML. It is improbable that anyone entered it on purpose. Rather, it was generated by software, probable when performing character code conversion and detecting malformed data (e.g., a byte that has no defined meaning in the source encoding). The U+001A (SUBSTITUTE) character is meant for such use; see my quick reference to C0 Controls.

    If you cannot track down and fix the conversion (or other process) that produced the U+001A, I’d suggest that you replace it by U+FFFD REPLACEMENT CHARACTER. It’s in a sense the Unicode equivalent of U+001A. (The latter is of course in Unicode too, but disallowed in many contexts.) However it has a visible glyph, though the glyph exists in a few fonts only; check the fileformat.info entry on U+FFFD for more info.

    The point here is that changing U+001A to U+FFFD makes the data acceptable in XML and still retains the information about character-level data error.

提交回复
热议问题