BeautifulSoup Grab Visible Webpage Text

前端 未结 10 639
北恋
北恋 2020-11-22 07:35

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the

10条回答
  •  悲&欢浪女
    2020-11-22 07:51

    If you care about performance, here's another more efficient way:

    import re
    
    INVISIBLE_ELEMS = ('style', 'script', 'head', 'title')
    RE_SPACES = re.compile(r'\s{3,}')
    
    def visible_texts(soup):
        """ get visible text from a document """
        text = ' '.join([
            s for s in soup.strings
            if s.parent.name not in INVISIBLE_ELEMS
        ])
        # collapse multiple spaces to two spaces.
        return RE_SPACES.sub('  ', text)
    

    soup.strings is an iterator, and it returns NavigableString so that you can check the parent's tag name directly, without going through multiple loops.

提交回复
热议问题