Here is what I have so far:
from bs4 import BeautifulSoup
def cleanme(html):
soup = BeautifulSoup(html) # create a new bs4 object from the html data loa
It looks like you almost have it. You need to also remove the html tags and css styling code. Here is my solution (I updated the function):
def cleanMe(html):
soup = BeautifulSoup(html, "html.parser") # create a new bs4 object from the html data loaded
for script in soup(["script", "style"]): # remove all javascript and stylesheet code
script.extract()
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
return text
Removing specified tags and comments in a clean manner. Thanks to Kim Hyesung for this code.
from bs4 import BeautifulSoup
from bs4 import Comment
def cleanMe(html):
soup = BeautifulSoup(html, "html5lib")
[x.extract() for x in soup.find_all('script')]
[x.extract() for x in soup.find_all('style')]
[x.extract() for x in soup.find_all('meta')]
[x.extract() for x in soup.find_all('noscript')]
[x.extract() for x in soup.find_all(text=lambda text:isinstance(text, Comment))]
return soup
If you want a quick and dirty solution you ca use:
re.sub(r'<[^>]*?>', '', value)
To make an equivalent of strip_tags in php. Is that what you want?
You can use decompose to completely remove the tags from the document and stripped_strings generator to retrieve the tag content.
def clean_me(html):
soup = BeautifulSoup(html)
for s in soup(['script', 'style']):
s.decompose()
return ' '.join(soup.stripped_strings)
>>> clean_me(testhtml)
'THIS IS AN EXAMPLE I need this text captured And this'
Using lxml instead:
# Requirements: pip install lxml
import lxml.html.clean
def cleanme(content):
cleaner = lxml.html.clean.Cleaner(
allow_tags=[''],
remove_unknown_tags=False,
style=True,
)
html = lxml.html.document_fromstring(content)
html_clean = cleaner.clean_html(html)
return html_clean.text_content().strip()
testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text captured<h1>And this</h1></body>"
cleaned = cleanme(testhtml)
print (cleaned)