Extracting text from HTML file using Python

后端 未结 30 2131
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答
  •  情话喂你
    2020-11-22 04:28

    Another option is to run the html through a text based web browser and dump it. For example (using Lynx):

    lynx -dump html_to_convert.html > converted_html.txt
    

    This can be done within a python script as follows:

    import subprocess
    
    with open('converted_html.txt', 'w') as outputFile:
        subprocess.call(['lynx', '-dump', 'html_to_convert.html'], stdout=testFile)
    

    It won't give you exactly just the text from the HTML file, but depending on your use case it may be preferable to the output of html2text.

提交回复
热议问题