问题
I have a strange problem with lxml when using the deployed version of my Django application. I use lxml to parse another HTML page which I fetch from my server. This works perfectly well on my development server on my own computer, but for some reason it gives me UnicodeDecodeError
on the server.
('utf8', "\x85why hello there!", 0, 1, 'unexpected code byte')
I have made sure that Apache (with mod_python) runs with LANG='en_US.UTF-8'
.
I've tried googling for this problem and tried different approaches to decoding the string correctly, but I can't figure it out.
In your answer, you may assume that my string is called hello
or something.
回答1:
"\x85why hello there!" is not a utf-8 encoded string. You should try decoding the webpage before passing it to lxml. Check what encoding it uses by looking at the http headers when you fetch the page maybe you find the problem there.
回答2:
Doesn't syntax such as u"\x85why hello there!"
help?
You may find the following resources from the official Python documentation helpful:
- Python introduction, Unicode Strings
- Sequence Types — str, unicode, list, tuple, buffer, xrange
回答3:
Since modifying site.py is not an ideal solution try this at the start of your program:
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
来源:https://stackoverflow.com/questions/808275/decoding-problems-in-django-and-lxml