I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.
Beautiful soup does convert html entities. It's probably your best bet considering HTML is often buggy and filled with unicode and html encoding issues. This is the code I use to convert html to raw text:
import BeautifulSoup
def getsoup(data, to_unicode=False):
data = data.replace(" ", " ")
# Fixes for bad markup I've seen in the wild. Remove if not applicable.
masssage_bad_comments = [
(re.compile(''), lambda match: ''),
]
myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(masssage_bad_comments)
return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage,
convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES
if to_unicode else None)
remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else ""