Short answer: use page.content
, not page.text
.
From http://lxml.de/parsing.html#python-unicode-strings :
the parsers in lxml.etree can handle unicode strings straight away ... This requires, however, that unicode strings do not specify a conflicting encoding themselves and thus lie about their real encoding
From http://docs.python-requests.org/en/latest/user/quickstart/#response-content :
Requests will automatically decode content from the server [as r.text
]. ... You can also access the response body as bytes [as r.content
].
So you see, both requests.text
and lxml.etree
want to decode the utf-8 to unicode. But if we let requests.text
do the decoding, then the encoding statement inside the xml file becomes a lie.
So, let's have requests.content
do no decoding. That way lxml
will receive a consistently undecoded file.