How can I strip comment tags from HTML using BeautifulSoup?

后端 未结 4 774
暖寄归人
暖寄归人 2020-11-28 13:41

I have been playing with BeautifulSoup, which is great. My end goal is to try and just get the text from a page. I am just trying to get the text from the body, with a speci

相关标签:
4条回答
  • 2020-11-28 14:07

    if mutation isn't your bag, you can

    [t for t in soup.find_all(text=True) if not isinstance(t, Comment)]
    
    0 讨论(0)
  • 2020-11-28 14:10

    I am still trying to figure out why it doesn't find and strip tags like this: <!-- //-->. Those backslashes cause certain tags to be overlooked.

    This may be a problem with the underlying SGML parser: see http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps. You can override it by using a markupMassage regex -- straight from the docs:

    import re, copy
    
    myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))]
    myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
    myNewMassage.extend(myMassage)
    
    BeautifulSoup(badString, markupMassage=myNewMassage)
    # Foo<!--This comment is malformed.-->Bar<br />Baz
    
    0 讨论(0)
  • 2020-11-28 14:11

    If you are looking for solution in BeautifulSoup version 3 BS3 Docs - Comment

    soup = BeautifulSoup("""Hello! <!--I've got to be nice to get what I want.-->""")
    comment = soup.find(text=re.compile("if"))
    Comment=comment.__class__
    for element in soup(text=lambda text: isinstance(text, Comment)):
        element.extract()
    print soup.prettify()
    
    0 讨论(0)
  • 2020-11-28 14:22

    Straight from the documentation for BeautifulSoup, you can easily strip comments (or anything) using extract():

    from BeautifulSoup import BeautifulSoup, Comment
    soup = BeautifulSoup("""1<!--The loneliest number-->
                            <a>2<!--Can be as bad as one--><b>3""")
    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
    [comment.extract() for comment in comments]
    print soup
    # 1
    # <a>2<b>3</b></a>
    
    0 讨论(0)
提交回复
热议问题