Encoding for an XML document containing U+001A

后端未结

关注

 3  599

I have an XML document that\'s being generated from some content that people are copy/pasting from all sorts of places (Word documents mostly though).

It looks like

相关标签:

3条回答

Happy的楠姐

2021-01-16 04:27
U+001A is not a valid character in an XML document. The valid range of characters according to the specification is:
```
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
清酒与你

2021-01-16 04:30
Preprocess the original data, encoding Unicode characters not supported by XML documents yourself. for example, use HTML character encodings:
```
<?xml version="1.0" encoding="UTF-8"?>
<response>
    <data> <![CDATA[This is &#x1a; a test.]]></data>        
</response>
```
You'll have to post-process the data when read back in to convert the HTML encoding back to the correct Unicode character.
0 讨论(0)
发布评论:

提交评论
- 加载中...
隐瞒了意图╮

2021-01-16 04:34

The character U+001A is in the C0 Controls area, which is mostly (including U+001A) forbidden in XML. It is improbable that anyone entered it on purpose. Rather, it was generated by software, probable when performing character code conversion and detecting malformed data (e.g., a byte that has no defined meaning in the source encoding). The U+001A (SUBSTITUTE) character is meant for such use; see my quick reference to C0 Controls.

If you cannot track down and fix the conversion (or other process) that produced the U+001A, I’d suggest that you replace it by U+FFFD REPLACEMENT CHARACTER. It’s in a sense the Unicode equivalent of U+001A. (The latter is of course in Unicode too, but disallowed in many contexts.) However it has a visible glyph, though the glyph exists in a few fonts only; check the fileformat.info entry on U+FFFD for more info.

The point here is that changing U+001A to U+FFFD makes the data acceptable in XML and still retains the information about character-level data error.

0 讨论(0)
发布评论:

提交评论
- 加载中...