BeautifulSoup Grab Visible Webpage Text

前端未结

关注

 10  886

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the

相关标签:

10条回答

悲&欢浪女

2020-11-22 07:51

If you care about performance, here's another more efficient way:

import re

INVISIBLE_ELEMS = ('style', 'script', 'head', 'title')
RE_SPACES = re.compile(r'\s{3,}')

def visible_texts(soup):
    """ get visible text from a document """
    text = ' '.join([
        s for s in soup.strings
        if s.parent.name not in INVISIBLE_ELEMS
    ])
    # collapse multiple spaces to two spaces.
    return RE_SPACES.sub('  ', text)

soup.strings is an iterator, and it returns NavigableString so that you can check the parent's tag name directly, without going through multiple loops.

0 讨论(0)

长情又很酷

2020-11-22 07:52

import urllib
from bs4 import BeautifulSoup

url = "https://www.yahoo.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))

0 讨论(0)

孤独总比滥情好

2020-11-22 07:53
The title is inside an <nyt_headline> tag, which is nested inside an <h1> tag and a <div> tag with id "article".
```
soup.findAll('nyt_headline', limit=1)
```
Should work.

The article body is inside an <nyt_text> tag, which is nested inside a <div> tag with id "articleBody". Inside the <nyt_text> element, the text itself is contained within <p> tags. Images are not within those <p> tags. It's difficult for me to experiment with the syntax, but I expect a working scrape to look something like this.
```
text = soup.findAll('nyt_text', limit=1)[0]
text.findAll('p')
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
借酒劲吻你

2020-11-22 07:58
Using BeautifulSoup the easiest way with less code to just get the strings, without empty lines and crap.
```
tag = <Parent_Tag_that_contains_the_data>
soup = BeautifulSoup(tag, 'html.parser')

for i in soup.stripped_strings:
    print repr(i)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
傲寒

2020-11-22 07:59
I completely respect using Beautiful Soup to get rendered content, but it may not be the ideal package for acquiring the rendered content on a page.

I had a similar problem to get rendered content, or the visible content in a typical browser. In particular I had many perhaps atypical cases to work with such a simple example below. In this case the non displayable tag is nested in a style tag, and is not visible in many browsers that I have checked. Other variations exist such as defining a class tag setting display to none. Then using this class for the div.
```
<html>
  <title>  Title here</title>

  <body>

    lots of text here <p> <br>
    <h1> even headings </h1>

    <style type="text/css"> 
        <div > this will not be visible </div> 
    </style>


  </body>

</html>
```
One solution posted above is:
```
html = Utilities.ReadFile('simple.html')
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)
visible_texts = filter(visible, texts)
print(visible_texts)


[u'\n', u'\n', u'\n\n        lots of text here ', u' ', u'\n', u' even headings ', u'\n', u' this will not be visible ', u'\n', u'\n']
```
This solution certainly has applications in many cases and does the job quite well generally but in the html posted above it retains the text that is not rendered. After searching SO a couple solutions came up here BeautifulSoup get_text does not strip all tags and JavaScript and here Rendered HTML to plain text using Python

I tried both these solutions: html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speeds highly depend on the contents of the data...

One answer here from @Helge was about using nltk of all things.
```
import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop
```
It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.
```
betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2020-11-22 08:00
The approved answer from @jbochi does not work for me. The str() function call raises an exception because it cannot encode the non-ascii characters in the BeautifulSoup element. Here is a more succinct way to filter the example web page to visible text.
```
html = open('21storm.html').read()
soup = BeautifulSoup(html)
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页