Beautifulsoup lost nodes

问题

I am using Python and Beautifulsoup to parse HTML-Data and get p-tags out of RSS-Feeds. However, some urls cause problems because the parsed soup-object does not include all nodes of the document.

For example I tried to parse http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm

But after comparing the parsed object with the pages source code, I noticed that all nodes after ul class="nextgen-left" are missing.

Here is how I parse the Documents:

from bs4 import BeautifulSoup as bs

url = 'http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm'

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
request = urllib2.Request(url)

response = opener.open(request) 

soup = bs(response,'lxml')        
print soup

回答1:

The input HTML is not quite conformant, so you'll have to use a different parser here. The html5lib parser handles this page correctly:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm')
>>> soup = BeautifulSoup(r.text, 'lxml')
>>> soup.find('div', id='story-body') is not None
False
>>> soup = BeautifulSoup(r.text, 'html5')
>>> soup.find('div', id='story-body') is not None
True

来源：https://stackoverflow.com/questions/16316793/beautifulsoup-lost-nodes

标签

python

beautifulsoup

html5lib

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!