Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the
If you care about performance, here's another more efficient way:
import re
INVISIBLE_ELEMS = ('style', 'script', 'head', 'title')
RE_SPACES = re.compile(r'\s{3,}')
def visible_texts(soup):
""" get visible text from a document """
text = ' '.join([
s for s in soup.strings
if s.parent.name not in INVISIBLE_ELEMS
])
# collapse multiple spaces to two spaces.
return RE_SPACES.sub(' ', text)
soup.strings
is an iterator, and it returns NavigableString
so that you can check the parent's tag name directly, without going through multiple loops.