I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.
Anyone has tried bleach.clean(html,tags=[],strip=True)
with bleach? it's working for me.
The best piece of code I found for extracting text without getting javascript or not wanted things :
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
You just have to install BeautifulSoup before :
pip install beautifulsoup4
Another example using BeautifulSoup4 in Python 2.7.9+
includes:
import urllib2
from bs4 import BeautifulSoup
Code:
def read_website_to_text(url):
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
return str(text.encode('utf-8'))
Explained:
Read in the url data as html (using BeautifulSoup), remove all script and style elements, and also get just the text using .get_text(). Break into lines and remove leading and trailing space on each, then break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")). Then using text = '\n'.join, drop blank lines, finally return as sanctioned utf-8.
Notes:
Some systems this is run on will fail with https:// connections because of SSL issue, you can turn off the verify to fix that issue. Example fix: http://blog.pengyifan.com/how-to-fix-python-ssl-certificate_verify_failed/
Python < 2.7.9 may have some issue running this
text.encode('utf-8') can leave weird encoding, may want to just return str(text) instead.
you can extract only text from HTML with BeautifulSoup
url = "https://www.geeksforgeeks.org/extracting-email-addresses-using-regular-expressions-python/"
con = urlopen(url).read()
soup = BeautifulSoup(con,'html.parser')
texts = soup.get_text()
print(texts)
@PeYoTIL's answer using BeautifulSoup and eliminating style and script content didn't work for me. I tried it using decompose
instead of extract
but it still didn't work. So I created my own which also formats the text using the <p>
tags and replaces <a>
tags with the href link. Also copes with links inside text. Available at this gist with a test doc embedded.
from bs4 import BeautifulSoup, NavigableString
def html_to_text(html):
"Creates a formatted text email message as a string from a rendered html template (page)"
soup = BeautifulSoup(html, 'html.parser')
# Ignore anything in head
body, text = soup.body, []
for element in body.descendants:
# We use type and not isinstance since comments, cdata, etc are subclasses that we don't want
if type(element) == NavigableString:
# We use the assumption that other tags can't be inside a script or style
if element.parent.name in ('script', 'style'):
continue
# remove any multiple and leading/trailing whitespace
string = ' '.join(element.string.split())
if string:
if element.parent.name == 'a':
a_tag = element.parent
# replace link text with the link
string = a_tag['href']
# concatenate with any non-empty immediately previous string
if ( type(a_tag.previous_sibling) == NavigableString and
a_tag.previous_sibling.string.strip() ):
text[-1] = text[-1] + ' ' + string
continue
elif element.previous_sibling and element.previous_sibling.name == 'a':
text[-1] = text[-1] + ' ' + string
continue
elif element.parent.name == 'p':
# Add extra paragraph formatting newline
string = '\n' + string
text += [string]
doc = '\n'.join(text)
return doc
Here's the code I use on a regular basis.
from bs4 import BeautifulSoup
import urllib.request
def processText(webpage):
# EMPTY LIST TO STORE PROCESSED TEXT
proc_text = []
try:
news_open = urllib.request.urlopen(webpage.group())
news_soup = BeautifulSoup(news_open, "lxml")
news_para = news_soup.find_all("p", text = True)
for item in news_para:
# SPLIT WORDS, JOIN WORDS TO REMOVE EXTRA SPACES
para_text = (' ').join((item.text).split())
# COMBINE LINES/PARAGRAPHS INTO A LIST
proc_text.append(para_text)
except urllib.error.HTTPError:
pass
return proc_text
I hope that helps.