Looking at the XML header
Am I right to state that the encoding
XML parsers are only required to support at least UTF-8 and UTF-16. The XML parser starts by trying the encodings based on the Byte Order Mark (BOM), if present (for UTF-16, UTF-32 and even UTF-8 with the dummy BOM). If none is found, then the parser will try UTF-32, UTF-16, UTF-8, ASCII and other ASCII-compatible single-byte encodings. Only then will it see the encoding attribute, and will restart parsing if necessary.
I think in principle you might have a point that the encoding
statement is 'late' in the file, however, the whole first line only uses basic characters. AFAIK, those are the same in almost all encodings, so whatever you decode it as, it'll read <?xml ... ?>
anyway.
Whatever comes after that however, could matter. For example text in a CDATA section could be encoded in a Cyrillic encoding.
You're quite right that it looks like an odd design. It only works because the XML declaration uses only ASCII characters, and nearly all encodings are supersets of ASCII. If you're prepared to accept something that isn't, for example EBCDIC, you can check whether the file starts with whatever the EBCDIC representation of "<?xml"
is. Which means you're relying on the general level of redundancy in the header of the file, rather than purely the encoding attribute itself. Like many things in XML, it's pragmatic and works, but isn't particularly elegant.
As you mentioned, you'd have to know the encoding of the file to read the encoding
attribute.
However, there is a heuristic that can easily get you close enough to the "real" encoding to allow you to read the encoding attribute. This works, because the <?xml
part by definition can only contain characters in the ASCII range (however they are encoded).
The XML standard even describes the exact process used to find out the encoding.
And the encoding label isn't redundant either. For example, if you use the algorithm in the XML spec to find out that some ASCII-based (or ASCII-compatible) encoding is used you still need to read the encoding to find out which one is actually use (valid candidates would be ASCII, UTF-8, any of the ISO-8859-* encodings, any of the Windows-* encodings, KOI8-R and many, many others). For the <?xml
part itself it won't make a difference which one it is, but for the rest of the document, it can make a huge difference.
Regarding mis-labeled XML files: yes, it's easy to produce those, however: the XML spec clearly specifies that those files are mal-formed and as such are not correct XML. Incorrect encodings must be reported as an error (as long as they can be detected!). So it's the problem of whoever is producing the XML.