Beautifulsoup 4: Remove comment tag and its content

夙愿已清 提交于 2020-06-07 21:08:12

问题


So the page that I'm scrapping contains these html codes. How do I remove the comment tag <!-- --> along with its content with bs4?

<div class="foo">
cat dog sheep goat
<!-- 
<p>NewPP limit report
Preprocessor node count: 478/300000
Post‐expand include size: 4852/2097152 bytes
Template argument size: 870/2097152 bytes
Expensive parser function count: 2/100
ExtLoops count: 6/100
</p>
-->

</div>

回答1:


You can use extract() (solution is based on this answer):

PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.

from bs4 import BeautifulSoup, Comment

data = """<div class="foo">
cat dog sheep goat
<!--
<p>test</p>
-->
</div>"""

soup = BeautifulSoup(data)

div = soup.find('div', class_='foo')
for element in div(text=lambda text: isinstance(text, Comment)):
    element.extract()

print soup.prettify()

As a result you get your div without comments:

<div class="foo">
    cat dog sheep goat
</div>



回答2:


Usually modifying the bs4 parse tree is unnecessary. You can just get the div's text, if that's what you wanted:

soup.body.div.text
Out[18]: '\ncat dog sheep goat\n\n'

bs4 separates out the comment. However if you really need to modify the parse tree:

from bs4 import Comment

for child in soup.body.div.children:
    if isinstance(child,Comment):
        child.extract()



回答3:


From this answer If you are looking for solution in BeautifulSoup version 3 BS3 Docs - Comment

soup = BeautifulSoup("""Hello! <!--I've got to be nice to get what I want.-->""")
comment = soup.find(text=re.compile("if"))
Comment=comment.__class__
for element in soup(text=lambda text: isinstance(text, Comment)):
    element.extract()
print soup.prettify()


来源:https://stackoverflow.com/questions/23299557/beautifulsoup-4-remove-comment-tag-and-its-content

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!