BeautifulSoup Grab Visible Webpage Text

前端未结

关注

 10  634

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the

相关标签:

10条回答

天涯浪人

2020-11-22 08:00
While, i would completely suggest using beautiful-soup in general, if anyone is looking to display the visible parts of a malformed html (e.g. where you have just a segment or line of a web-page) for whatever-reason, the the following will remove content between < and > tags:
```
import re ## only use with malformed html - this is not efficient
def display_visible_html_using_re(text): 
 return(re.sub("(\<.*?\>)", "",text))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

故里飘歌

2020-11-22 08:04

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
import re
import ssl

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    if re.match(r"[\n]+",str(element)): return False
    return True
def text_from_html(url):
    body = urllib.request.urlopen(url,context=ssl._create_unverified_context()).read()
    soup = BeautifulSoup(body ,"lxml")
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    text = u",".join(t.strip() for t in visible_texts)
    text = text.lstrip().rstrip()
    text = text.split(',')
    clean_text = ''
    for sen in text:
        if sen:
            sen = sen.rstrip().lstrip()
            clean_text += sen+','
    return clean_text
url = 'http://www.nytimes.com/2009/12/21/us/21storm.html'
print(text_from_html(url))

0 讨论(0)

离开以前

2020-11-22 08:14

Try this:

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))

0 讨论(0)

孤独总比滥情好

2020-11-22 08:17
The simplest way to handle this case is by using getattr(). You can adapt this example to your needs:
```
from bs4 import BeautifulSoup

source_html = """

 <a class="ratingNumber" href="https://www.youtube.com/watch?v=oHg5SJYRHA0" target="_blank" rel="noopener">
 3.7
 </a>

"""

soup = BeautifulSoup(source_html, "lxml")
my_ratings = getattr(soup.find('span', {"class": "ratingsContent"}), "text", None)
print(my_ratings)
```
This will find the text element,"3.7", within the tag object 3.7 when it exists, however, default to NoneType when it does not.

getattr(object, name[, default])

Return the value of the named attribute of object. name must be a string. If the string is the name of one of the object’s attributes, the result is the value of that attribute. For example, getattr(x, 'foobar') is equivalent to x.foobar. If the named attribute does not exist, default is returned if provided, otherwise, AttributeError is raised.
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2