I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.
I know there are a lot of answers already, but the most elegent and pythonic solution I have found is described, in part, here.
from bs4 import BeautifulSoup
text = ''.join(BeautifulSoup(some_html_string, "html.parser").findAll(text=True))
Based on Fraser's comment, here is more elegant solution:
from bs4 import BeautifulSoup
clean_text = ''.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)