Extracting text from HTML file using Python

后端 未结 30 2086
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答
  •  北海茫月
    2020-11-22 04:18

    Best worked for me is inscripts .

    https://github.com/weblyzard/inscriptis

    import urllib.request
    from inscriptis import get_text
    
    url = "http://www.informationscience.ch"
    html = urllib.request.urlopen(url).read().decode('utf-8')
    
    text = get_text(html)
    print(text)
    

    The results are really good

提交回复
热议问题