BeautifulSoup Grab Visible Webpage Text

前端 未结 10 621
北恋
北恋 2020-11-22 07:35

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the

相关标签:
10条回答
  • 2020-11-22 08:00

    While, i would completely suggest using beautiful-soup in general, if anyone is looking to display the visible parts of a malformed html (e.g. where you have just a segment or line of a web-page) for whatever-reason, the the following will remove content between < and > tags:

    import re   ## only use with malformed html - this is not efficient
    def display_visible_html_using_re(text):             
        return(re.sub("(\<.*?\>)", "",text))
    
    0 讨论(0)
  • 2020-11-22 08:04
    from bs4 import BeautifulSoup
    from bs4.element import Comment
    import urllib.request
    import re
    import ssl
    
    def tag_visible(element):
        if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
            return False
        if isinstance(element, Comment):
            return False
        if re.match(r"[\n]+",str(element)): return False
        return True
    def text_from_html(url):
        body = urllib.request.urlopen(url,context=ssl._create_unverified_context()).read()
        soup = BeautifulSoup(body ,"lxml")
        texts = soup.findAll(text=True)
        visible_texts = filter(tag_visible, texts)  
        text = u",".join(t.strip() for t in visible_texts)
        text = text.lstrip().rstrip()
        text = text.split(',')
        clean_text = ''
        for sen in text:
            if sen:
                sen = sen.rstrip().lstrip()
                clean_text += sen+','
        return clean_text
    url = 'http://www.nytimes.com/2009/12/21/us/21storm.html'
    print(text_from_html(url))
    
    0 讨论(0)
  • 2020-11-22 08:14

    Try this:

    from bs4 import BeautifulSoup
    from bs4.element import Comment
    import urllib.request
    
    
    def tag_visible(element):
        if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
            return False
        if isinstance(element, Comment):
            return False
        return True
    
    
    def text_from_html(body):
        soup = BeautifulSoup(body, 'html.parser')
        texts = soup.findAll(text=True)
        visible_texts = filter(tag_visible, texts)  
        return u" ".join(t.strip() for t in visible_texts)
    
    html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
    print(text_from_html(html))
    
    0 讨论(0)
  • 2020-11-22 08:17

    The simplest way to handle this case is by using getattr(). You can adapt this example to your needs:

    from bs4 import BeautifulSoup
    
    source_html = """
    <span class="ratingsDisplay">
        <a class="ratingNumber" href="https://www.youtube.com/watch?v=oHg5SJYRHA0" target="_blank" rel="noopener">
            <span class="ratingsContent">3.7</span>
        </a>
    </span>
    """
    
    soup = BeautifulSoup(source_html, "lxml")
    my_ratings = getattr(soup.find('span', {"class": "ratingsContent"}), "text", None)
    print(my_ratings)
    

    This will find the text element,"3.7", within the tag object <span class="ratingsContent">3.7</span> when it exists, however, default to NoneType when it does not.

    getattr(object, name[, default])

    Return the value of the named attribute of object. name must be a string. If the string is the name of one of the object’s attributes, the result is the value of that attribute. For example, getattr(x, 'foobar') is equivalent to x.foobar. If the named attribute does not exist, default is returned if provided, otherwise, AttributeError is raised.

    0 讨论(0)
提交回复
热议问题