Extracting text from HTML file using Python

后端 未结 30 2097
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答
  •  情歌与酒
    2020-11-22 04:29

    This isn't exactly a Python solution, but it will convert text Javascript would generate into text, which I think is important (E.G. google.com). The browser Links (not Lynx) has a Javascript engine, and will convert source to text with the -dump option.

    So you could do something like:

    fname = os.tmpnam()
    fname.write(html_source)
    proc = subprocess.Popen(['links', '-dump', fname], 
                            stdout=subprocess.PIPE,
                            stderr=open('/dev/null','w'))
    text = proc.stdout.read()
    

提交回复
热议问题