Extracting text from HTML file using Python

后端未结

关注

 30  2131

一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答

情话喂你 (楼主)

2020-11-22 04:28
Another option is to run the html through a text based web browser and dump it. For example (using Lynx):
```
lynx -dump html_to_convert.html > converted_html.txt
```
This can be done within a python script as follows:
```
import subprocess

with open('converted_html.txt', 'w') as outputFile:
    subprocess.call(['lynx', '-dump', 'html_to_convert.html'], stdout=testFile)
```
It won't give you exactly just the text from the HTML file, but depending on your use case it may be preferable to the output of html2text.
0 讨论(0)

查看其它30个回答
发布评论:

提交评论
- 加载中...