Extracting text from HTML file using Python

后端 未结 30 2099
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答
  •  孤独总比滥情好
    2020-11-22 04:32

    While alot of people mentioned using regex to strip html tags, there are a lot of downsides.

    for example:

    hello world

    I love you

    Should be parsed to:

    Hello world
    I love you
    

    Here's a snippet I came up with, you can cusomize it to your specific needs, and it works like a charm

    import re
    import html
    def html2text(htm):
        ret = html.unescape(htm)
        ret = ret.translate({
            8209: ord('-'),
            8220: ord('"'),
            8221: ord('"'),
            160: ord(' '),
        })
        ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE)
        ret = re.sub("
    |
    |

    |
    |", "\n", ret, flags = re.IGNORECASE) ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL) ret = re.sub(r" +", " ", ret) return ret

提交回复
热议问题