I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.
I've had good results with Apache Tika. Its purpose is the extraction of metadata and text from content, hence the underlying parser is tuned accordingly out of the box.
Tika can be run as a server, is trivial to run / deploy in a Docker container, and from there can be accessed via Python bindings.