Beautifulsoup 4: Remove comment tag and its content

前端 未结 3 1380
挽巷
挽巷 2020-12-31 04:35

So the page that I\'m scrapping contains these html codes. How do I remove the comment tag along with its content with bs4?

相关标签:
3条回答
  • 2020-12-31 05:17

    You can use extract() (solution is based on this answer):

    PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.

    from bs4 import BeautifulSoup, Comment
    
    data = """<div class="foo">
    cat dog sheep goat
    <!--
    <p>test</p>
    -->
    </div>"""
    
    soup = BeautifulSoup(data)
    
    div = soup.find('div', class_='foo')
    for element in div(text=lambda text: isinstance(text, Comment)):
        element.extract()
    
    print soup.prettify()
    

    As a result you get your div without comments:

    <div class="foo">
        cat dog sheep goat
    </div>
    
    0 讨论(0)
  • 2020-12-31 05:26

    From this answer If you are looking for solution in BeautifulSoup version 3 BS3 Docs - Comment

    soup = BeautifulSoup("""Hello! <!--I've got to be nice to get what I want.-->""")
    comment = soup.find(text=re.compile("if"))
    Comment=comment.__class__
    for element in soup(text=lambda text: isinstance(text, Comment)):
        element.extract()
    print soup.prettify()
    
    0 讨论(0)
  • 2020-12-31 05:34

    Usually modifying the bs4 parse tree is unnecessary. You can just get the div's text, if that's what you wanted:

    soup.body.div.text
    Out[18]: '\ncat dog sheep goat\n\n'
    

    bs4 separates out the comment. However if you really need to modify the parse tree:

    from bs4 import Comment
    
    for child in soup.body.div.children:
        if isinstance(child,Comment):
            child.extract()
    
    0 讨论(0)
提交回复
热议问题