Here is what I have so far:
from bs4 import BeautifulSoup
def cleanme(html):
soup = BeautifulSoup(html) # create a new bs4 object from the html data loa
You can use decompose to completely remove the tags from the document and stripped_strings generator to retrieve the tag content.
def clean_me(html):
soup = BeautifulSoup(html)
for s in soup(['script', 'style']):
s.decompose()
return ' '.join(soup.stripped_strings)
>>> clean_me(testhtml)
'THIS IS AN EXAMPLE I need this text captured And this'