Why does ElementTree reject UTF-16 XML declarations with “encoding incorrect”?

前端 未结 1 1967
面向向阳花
面向向阳花 2021-02-13 17:05

In Python 2.7, when passing a unicode string to ElementTree\'s fromstring() method that has encoding=\"UTF-16\" in the XML declaration, I\'m getting a

相关标签:
1条回答
  • 2021-02-13 17:43

    I'm not going to try to justify the behavior, but to explain why it's actually happening with the code as written.

    In short: the XML parser that Python uses, expat, operates on bytes, not unicode characters. You MUST call .encode('utf-16-be') or .encode('utf-16-le') on the string before you pass it to ElementTree.fromstring:

    ElementTree.fromstring(data.encode('utf-16-be'))
    

    Proof: ElementTree.fromstring eventually calls down into pyexpat.xmlparser.Parse, which is implemented in pyexpat.c:

    static PyObject *
    xmlparse_Parse(xmlparseobject *self, PyObject *args)
    {
        char *s;
        int slen;
        int isFinal = 0;
    
        if (!PyArg_ParseTuple(args, "s#|i:Parse", &s, &slen, &isFinal))
            return NULL;
    
        return get_parse_result(self, XML_Parse(self->itself, s, slen, isFinal));
    }
    

    So the unicode parameter you passed in gets converted using s#. The docs for PyArg_ParseTuple say:

    s# (string, Unicode or any read buffer compatible object) [const char *, int (or Py_ssize_t, see below)] This variant on s stores into two C variables, the first one a pointer to a character string, the second one its length. In this case the Python string may contain embedded null bytes. Unicode objects pass back a pointer to the default encoded string version of the object if such a conversion is possible. All other read-buffer compatible objects pass back a reference to the raw internal data representation.

    Let's check this out:

    from xml.etree import ElementTree
    data = u'<?xml version="1.0" encoding="utf-8"?><root>\u2163</root>'
    print ElementTree.fromstring(data)
    

    gives the error:

    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2163' in position 44: ordinal not in range(128)
    

    which means that when you were specifying encoding="utf-8", you were just getting lucky that there weren't non-ASCII characters in your input when the Unicode string got encoded to ASCII. If you add the following before you parse, UTF-8 works as expected with that example:

    import sys
    reload(sys).setdefaultencoding('utf8')
    

    however, it doesn't work to set the defaultencoding to 'utf-16-be' or 'utf-16-le', since the Python bits of ElementTree do direct string comparisons which start to fail in UTF-16 land.

    0 讨论(0)
提交回复
热议问题