BeautifulSoup Grab Visible Webpage Text

前端未结

关注

 10  639

北恋 2020-11-22 07:35

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the

10条回答

悲&欢浪女 (楼主)

2020-11-22 07:51

If you care about performance, here's another more efficient way:

import re

INVISIBLE_ELEMS = ('style', 'script', 'head', 'title')
RE_SPACES = re.compile(r'\s{3,}')

def visible_texts(soup):
    """ get visible text from a document """
    text = ' '.join([
        s for s in soup.strings
        if s.parent.name not in INVISIBLE_ELEMS
    ])
    # collapse multiple spaces to two spaces.
    return RE_SPACES.sub('  ', text)

soup.strings is an iterator, and it returns NavigableString so that you can check the parent's tag name directly, without going through multiple loops.

0 讨论(0)

查看其它10个回答