XML Spec and UTF-16

雨燕双飞 提交于 2019-12-08 17:19:00

问题


Section 4.3.3 and Appendix F of the XML 1.0 spec speak about UTF-16, the byte order mark (BOM) in UTF-16 encoded data streams, and the XML encoding declaration. From the information in those sections, it would seem that a byte order mark is required in UTF-16 documents. But the summary chart in Appendix F gives a scenario where a UTF-16 input does not have a Byte order mark, but this scenario has an xml declaration. According to section 4.3.3, a UTF-16 encoded document does not require an encoding declaration (and the XML declaration itself is optional in such a case).

Given this information, is a UTF-16 xml document with neither a BOM nor an XML declaration that also lacks externally provided encoding information considered well-formed if the rest of the document is?


回答1:


From the Unicode 6.2 specification (page 99):

The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian.

So a BOM is not required in a UTF-16 document. But there may be a "higher-level protocol" such as the XML specification to indicate what needs to be done for UTF-16 XML documents without BOM.

Section 4.3.3 in the XML 1.0 specification says:

Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000], section 16.8 of [Unicode] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF).

Let's get back to the above later. Appendix F describes approaches for detecting the character encoding in case a BOM isn't present. But I don't think that section is relevant for your question as you're asking whether a UTF-16 XML document without BOM and without XML declaration is "well-formed" and Appendix F is a non-normative part of the specification.

So, going back to the specification, a document is well-formed if "Taken as a whole, it matches the production labeled document." (Section 2.1). Reviewing document shows that the XML declaration is optional (this is also mentioned in Section 2.8). So it's possible to have a well-formed document without a XML declaration; this answers half of your question.

The other half is whether a UTF-16 XML document without XML declaration but also without BOM can still be well-formed. In Section 4.3.3 it says (emphasis mine):

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.

Based on this a UTF-16 XML document without BOM and without encoding declaration (which is part of the XML declaration) is not a well-formed document (because a fatal error violates wellformed-ness, see definition of well-formedness constraint in Section 1.2) in the absence of external information. This also matches what was said earlier in Section 4.3.3 about the requirement of a BOM for UTF-16.



来源:https://stackoverflow.com/questions/20692447/xml-spec-and-utf-16

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!