Extracting text from HTML file using Python

后端未结

关注

 30  2099

一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答

孤独总比滥情好 (楼主)

2020-11-22 04:32
While alot of people mentioned using regex to strip html tags, there are a lot of downsides.

for example:
```
hello worldI love you
```
Should be parsed to:
```
Hello world
I love you
```
Here's a snippet I came up with, you can cusomize it to your specific needs, and it works like a charm
```
import re
import html
def html2text(htm):
    ret = html.unescape(htm)
    ret = ret.translate({
        8209: ord('-'),
        8220: ord('"'),
        8221: ord('"'),
        160: ord(' '),
    })
    ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE)
    ret = re.sub("
|
|
|
```
|", "\n", ret, flags = re.IGNORECASE) ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL) ret = re.sub(r" +", " ", ret) return ret

0 讨论(0)

查看其它30个回答