Encoding error while parsing RSS with lxml

偶尔善良 提交于 2019-12-10 02:38:12

问题


I want to parse downloaded RSS with lxml, but I don't know how to handle with UnicodeDecodeError?

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
tree = etree.parse(response, parser)

But I get an error:

tree   = etree.parse(response, parser)
File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67
740)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etr
ee.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 559, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64027)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 97: ordinal not in range(128)

回答1:


You should probably only be trying to define the character encoding as a last resort, since it's clear what the encoding is based on the XML prolog (if not by the HTTP headers.) Anyway, it's unnecessary to pass the encoding to etree.XMLParser unless you want to override the encoding; so get rid of the encoding parameter and it should work.

Edit: okay, the problem actually seems to be with lxml. The following works, for whatever reason:

parser = etree.XMLParser(ns_clean=True, recover=True)
etree.parse('http://wiadomosci.onet.pl/kraj/rss.xml', parser)



回答2:


I ran into a similar problem, and it turns out this has NOTHING to do with encodings. What's happening is this - lxml is throwing you a totally unrelated error. In this case, the error is that the .parse function expects a filename or URL, and not a string with the contents itself. However, when it tries to print out the error, it chokes on non-ascii characters and shows that completely confusing error message. It is highly unfortunate and other people have commented on this issue here:

https://mailman-mail5.webfaction.com/pipermail/lxml/2009-February/004393.html

Luckily, yours is a very easy fix. Just replace .parse with .fromstring and you should be totally good to go:

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)

## lxml Y U NO MAKE SENSE!!!
tree = etree.fromstring(response, parser)

Just tested this on my machine and it worked fine. Hope it helps!




回答3:


It's often easier to get the string loaded and sorted out for the lxml library first, and then call fromstring on it, rather than rely on the lxml.etree.parse() function and its difficult to manage encoding options.

This particular rss file begins with the encoding declaration, so everything should just work:

<?xml version="1.0" encoding="utf-8"?>

The following code shows some of the different variations you can apply to make etree parse for different encodings. You can also request it to write out different encodings too, which will appear in the headers.

import lxml.etree
import urllib2

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request).read()
print [response]
        # ['<?xml version="1.0" encoding="utf-8"?>\n<feed xmlns=... <title>Wiadomo\xc5\x9bci...']

uresponse = response.decode("utf8")
print [uresponse]    
        # [u'<?xml version="1.0" encoding="utf-8"?>\n<feed xmlns=... <title>Wiadomo\u015bci...']

tree = lxml.etree.fromstring(response)
res = lxml.etree.tostring(tree)
print [res]
        # ['<feed xmlns="http://www.w3.org/2005/Atom">\n<title>Wiadomo&#347;ci...']

lres = lxml.etree.tostring(tree, encoding="latin1")
print [lres]
        # ["<?xml version='1.0' encoding='latin1'?>\n<feed xmlns=...<title>Wiadomo&#347;ci...']


# works because the 38 character encoding declaration is sliced off
print lxml.etree.fromstring(uresponse[38:])   

# throws ValueError(u'Unicode strings with encoding declaration are not supported.',)
print lxml.etree.fromstring(uresponse)

Code can be tried here: http://scraperwiki.com/scrapers/lxml_and_encoding_declarations/edit/#



来源:https://stackoverflow.com/questions/5812009/encoding-error-while-parsing-rss-with-lxml

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!