问题
I'm pulling text from the Presidential debates. I got to one that has an issue: it errantly turns every mention of the word "debate" into a tag<debate>
. Go ahead, search for "Welcome back to the Republican presidential"; notice an obvious word missing?
Cool, so BeautifulSoup does a superb job of cleaning up messy HTML and adding closing tags were they should have been. But in this case, that mucks me up, because <debate>
is now a child of a <p>
and the closing </debate>
is added allllll the way at the end; thus nesting the remaining debate inside that tag.
How do I tell BeautifulSoup to either ignore or remove <debate>
? Or alternatively, how do I add a closing tag immediately after? I've tried unwrap, but by the time I can call it, BS has already set up the closing tag at the end, and thus made following paragraphs children rather than siblings.
Here's how I'm set up:
from bs4 import BeautifulSoup
import urllib
bad_debate = 'http://www.presidency.ucsb.edu/ws/index.php?pid=111395'
file = urllib.urlopen(bad_debate)
soup = BeautifulSoup(file)
My hunch is I need to insert something between the url call and BeautifulSoup, but for the life of me I can't figure out how to modify the file contents.
回答1:
html5lib parser does a better job (than lxml
or html.parser
) handling the debate
element in this case:
soup = BeautifulSoup(file, "html5lib")
Here is how it handles the mentioned part of the debate:
<p>
<b>
BARTIROMO:
</b>
Welcome back to the Republican presidential
<debate>
here in North Charleston. Right back to the questions. [
<i>
applause
</i>
]
</debate>
</p>
来源:https://stackoverflow.com/questions/37009785/how-do-i-remove-a-spurious-tag-in-beautifulsoup