So the page that I\'m scrapping contains these html codes. How do I remove the comment tag along with its content with bs4?
You can use extract() (solution is based on this answer):
PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.
from bs4 import BeautifulSoup, Comment
data = """<div class="foo">
cat dog sheep goat
<!--
<p>test</p>
-->
</div>"""
soup = BeautifulSoup(data)
div = soup.find('div', class_='foo')
for element in div(text=lambda text: isinstance(text, Comment)):
element.extract()
print soup.prettify()
As a result you get your div
without comments:
<div class="foo">
cat dog sheep goat
</div>
From this answer If you are looking for solution in BeautifulSoup version 3 BS3 Docs - Comment
soup = BeautifulSoup("""Hello! <!--I've got to be nice to get what I want.-->""")
comment = soup.find(text=re.compile("if"))
Comment=comment.__class__
for element in soup(text=lambda text: isinstance(text, Comment)):
element.extract()
print soup.prettify()
Usually modifying the bs4 parse tree is unnecessary. You can just get the div's text, if that's what you wanted:
soup.body.div.text
Out[18]: '\ncat dog sheep goat\n\n'
bs4
separates out the comment. However if you really need to modify the parse tree:
from bs4 import Comment
for child in soup.body.div.children:
if isinstance(child,Comment):
child.extract()