问题
I am using Python and Beautifulsoup to parse HTML-Data and get p-tags out of RSS-Feeds. However, some urls cause problems because the parsed soup-object does not include all nodes of the document.
For example I tried to parse http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm
But after comparing the parsed object with the pages source code, I noticed that all nodes after ul class="nextgen-left"
are missing.
Here is how I parse the Documents:
from bs4 import BeautifulSoup as bs
url = 'http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
request = urllib2.Request(url)
response = opener.open(request)
soup = bs(response,'lxml')
print soup
回答1:
The input HTML is not quite conformant, so you'll have to use a different parser here. The html5lib
parser handles this page correctly:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm')
>>> soup = BeautifulSoup(r.text, 'lxml')
>>> soup.find('div', id='story-body') is not None
False
>>> soup = BeautifulSoup(r.text, 'html5')
>>> soup.find('div', id='story-body') is not None
True
来源:https://stackoverflow.com/questions/16316793/beautifulsoup-lost-nodes