html5lib makes BeautifulSoup miss an element

倖福魔咒の 提交于 2019-12-08 08:43:02

问题


Contiuing my attempt to pull transcripts from the Presidential debates, I've no started using html5lib as a parser with BeautifulSoup.

But, now when I run (previously working) code to find the element with the actual transcript it errors out and claims not to find any such span.

Here's the code:

from bs4 import BeautifulSoup
import html5lib
import urllib

file = urllib.urlopen('http://www.presidency.ucsb.edu/ws/index.php?pid=111395')
soup = BeautifulSoup(file, "html5lib")
transcript = soup.find_all("span", class_="displaytext")[0]

And here's the error:

IndexError                                
Traceback (most recent call last)
<ipython-input-5-2c227e8c4a25> in <module>()
  1 file = urllib.urlopen('http://www.presidency.ucsb.edu/ws/index.php?pid=111395')
  2 soup = BeautifulSoup(file, "html5lib")
----> 3 transcript = soup.find_all("span", class_="displaytext")[0]

IndexError: list index out of range

And here's the relevant part of the page I'm calling, proving I'm not crazy, there is a span with class 'displaytext'

 <span class="displaytext">
           <b>
            PARTICIPANTS:
           </b>
           <br/>
           Former Governor Jeb Bush (FL);

What am I missing? If I run this without calling "html5lib" in the soup call, it works fine (but I get later errors due to spurious fake tag calls with no corresponding closing tag).

来源:https://stackoverflow.com/questions/37052097/html5lib-makes-beautifulsoup-miss-an-element

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!