问题
I got completely confused with gae. I have a script, that does a post request(using urlfetch from Google App Engine api) as a response we get a cp1251 encoded html page.
Then I decode it, using .decode('cp1251') and parse with lxml.
My code works totally fine on my local machine:
import re
import leaf #simple wrapper for lxml
weekdaysD={u'понедельник':1, u'вторник':2, u'среда':3, u'четверг':4, u'пятница':5, u'суббота':6}
document = leaf.parse(leaf.strip_symbols(leaf.strip_accents(html_in_cp1251.decode('cp1251'))))
table=document.get('table')
trs=table('tr') #leaf syntax
for tr in trs:
tds=tr.xpath('td')
for td in tds:
if td.colspan=='3':
curweek=re.findall('\w+(?=\-)', td.text)[0]
curday=weekdaysD[td.text.split(u',')[0]]
but when I deploy it to gae, I get:
curday=weekdaysD[td.text.split(u',')[0]]
KeyError: u'\xd0\xb2\xd1\x82\xd0\xbe\xd1\x80\xd0\xbd\xd0\xb8\xd0\xba'
How is non unicode characters there at all? And why is everything ok locally? I've tried all variations of decoding\encoding placing in my code - nothing helped. I'm stuck for a few days now.
UPD: also, if I add to my script on GAE:
print type(weekdaysD.keys()[0]), type(td.text.split(u',')[0])
It returns both as 'unicode'. So, I belive that html was decoded correctly. Could it be something with lxml on GAE?
回答1:
That string you got in the error message has unicode for its type but the contents is actually the bytes that would be the UTF-8 encoding of вторник. It would be helpful if you showed us the code that does the urlfetch call, since there is nothing wrong with the code you are showing.
回答2:
Well, the workaround of adding .encode('latin1').decode('utf-8', 'ignore') did the trick. I wish I could explain why it behaves so.
来源:https://stackoverflow.com/questions/9793086/python-unicode-behaviour-in-google-app-engine