I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.
While alot of people mentioned using regex to strip html tags, there are a lot of downsides.
for example:
hello world
I love you
Should be parsed to:
Hello world
I love you
Here's a snippet I came up with, you can cusomize it to your specific needs, and it works like a charm
import re
import html
def html2text(htm):
ret = html.unescape(htm)
ret = ret.translate({
8209: ord('-'),
8220: ord('"'),
8221: ord('"'),
160: ord(' '),
})
ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE)
ret = re.sub("
|
||