Using a regex
Using a regex, you can clean everything inside <>
:
import re
def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext
Some HTML texts can also contain entities, that are not enclosed in brackets such as '&nsbm
'. If that is the case then you might want to write the regex as
cleanr = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
This link contains more details on this.
Using BeautifulSoup
You could also use BeautifulSoup
additional package to find out all the raw text
You will need to explicitly set a parser when calling BeautifulSoup
I recommend "lxml" as mentioned in alternative answers (much more robust than the default one (i.e. available without additional install) 'html.parser'
from bs4 import BeautifulSoup
cleantext = BeautifulSoup(raw_html, "lxml").text
But it doesn't prevent you from using external libraries, so I recommend the first solution.