问题
Contiuing my attempt to pull transcripts from the Presidential debates, I've no started using html5lib as a parser with BeautifulSoup.
But, now when I run (previously working) code to find the element with the actual transcript it errors out and claims not to find any such span.
Here's the code:
from bs4 import BeautifulSoup
import html5lib
import urllib
file = urllib.urlopen('http://www.presidency.ucsb.edu/ws/index.php?pid=111395')
soup = BeautifulSoup(file, "html5lib")
transcript = soup.find_all("span", class_="displaytext")[0]
And here's the error:
IndexError
Traceback (most recent call last)
<ipython-input-5-2c227e8c4a25> in <module>()
1 file = urllib.urlopen('http://www.presidency.ucsb.edu/ws/index.php?pid=111395')
2 soup = BeautifulSoup(file, "html5lib")
----> 3 transcript = soup.find_all("span", class_="displaytext")[0]
IndexError: list index out of range
And here's the relevant part of the page I'm calling, proving I'm not crazy, there is a span with class 'displaytext'
<span class="displaytext">
<b>
PARTICIPANTS:
</b>
<br/>
Former Governor Jeb Bush (FL);
What am I missing? If I run this without calling "html5lib" in the soup call, it works fine (but I get later errors due to spurious fake tag calls with no corresponding closing tag).
来源:https://stackoverflow.com/questions/37052097/html5lib-makes-beautifulsoup-miss-an-element