Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the
While, i would completely suggest using beautiful-soup in general, if anyone is looking to display the visible parts of a malformed html (e.g. where you have just a segment or line of a web-page) for whatever-reason, the the following will remove content between <
and >
tags:
import re ## only use with malformed html - this is not efficient
def display_visible_html_using_re(text):
return(re.sub("(\<.*?\>)", "",text))
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
import re
import ssl
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
if re.match(r"[\n]+",str(element)): return False
return True
def text_from_html(url):
body = urllib.request.urlopen(url,context=ssl._create_unverified_context()).read()
soup = BeautifulSoup(body ,"lxml")
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
text = u",".join(t.strip() for t in visible_texts)
text = text.lstrip().rstrip()
text = text.split(',')
clean_text = ''
for sen in text:
if sen:
sen = sen.rstrip().lstrip()
clean_text += sen+','
return clean_text
url = 'http://www.nytimes.com/2009/12/21/us/21storm.html'
print(text_from_html(url))
Try this:
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))
The simplest way to handle this case is by using getattr()
. You can adapt this example to your needs:
from bs4 import BeautifulSoup
source_html = """
<span class="ratingsDisplay">
<a class="ratingNumber" href="https://www.youtube.com/watch?v=oHg5SJYRHA0" target="_blank" rel="noopener">
<span class="ratingsContent">3.7</span>
</a>
</span>
"""
soup = BeautifulSoup(source_html, "lxml")
my_ratings = getattr(soup.find('span', {"class": "ratingsContent"}), "text", None)
print(my_ratings)
This will find the text element,"3.7"
, within the tag object <span class="ratingsContent">3.7</span>
when it exists, however, default to NoneType
when it does not.
getattr(object, name[, default])
Return the value of the named attribute of object. name must be a string. If the string is the name of one of the object’s attributes, the result is the value of that attribute. For example, getattr(x, 'foobar') is equivalent to x.foobar. If the named attribute does not exist, default is returned if provided, otherwise, AttributeError is raised.