Extracting text from HTML file using Python

后端未结

关注

 30  2196

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

相关标签:

30条回答

一整个雨季

2020-11-22 04:26

The LibreOffice writer comment has merit since the application can employ python macros. It seems to offer multiple benefits both for answering this question and furthering the macro base of LibreOffice. If this resolution is a one-off implementation, rather than to be used as part of a greater production program, opening the HTML in writer and saving the page as text would seem to resolve the issues discussed here.

0 讨论(0)
发布评论:

提交评论
- 加载中...
庸人自扰

2020-11-22 04:27

PyParsing does a great job. The PyParsing wiki was killed so here is another location where there are examples of the use of PyParsing (example link). One reason for investing a little time with pyparsing is that he has also written a very brief very well organized O'Reilly Short Cut manual that is also inexpensive.

Having said that, I use BeautifulSoup a lot and it is not that hard to deal with the entities issues, you can convert them before you run BeautifulSoup.

Goodluck

0 讨论(0)
发布评论:

提交评论
- 加载中...
说谎

2020-11-22 04:28
There is Pattern library for data mining.

http://www.clips.ua.ac.be/pages/pattern-web

You can even decide what tags to keep:
```
s = URL('http://www.clips.ua.ac.be').download()
s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
print s
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
情话喂你

2020-11-22 04:28
Another option is to run the html through a text based web browser and dump it. For example (using Lynx):
```
lynx -dump html_to_convert.html > converted_html.txt
```
This can be done within a python script as follows:
```
import subprocess

with open('converted_html.txt', 'w') as outputFile:
    subprocess.call(['lynx', '-dump', 'html_to_convert.html'], stdout=testFile)
```
It won't give you exactly just the text from the HTML file, but depending on your use case it may be preferable to the output of html2text.
0 讨论(0)
发布评论:

提交评论
- 加载中...
情歌与酒

2020-11-22 04:29
This isn't exactly a Python solution, but it will convert text Javascript would generate into text, which I think is important (E.G. google.com). The browser Links (not Lynx) has a Javascript engine, and will convert source to text with the -dump option.

So you could do something like:
```
fname = os.tmpnam()
fname.write(html_source)
proc = subprocess.Popen(['links', '-dump', fname], 
                        stdout=subprocess.PIPE,
                        stderr=open('/dev/null','w'))
text = proc.stdout.read()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
梦如初夏

2020-11-22 04:29

I recommend a Python Package called goose-extractor Goose will try to extract the following information:

Main text of an article Main image of article Any Youtube/Vimeo movies embedded in article Meta Description Meta tags

More :https://pypi.python.org/pypi/goose-extractor/

0 讨论(0)
发布评论:

提交评论
- 加载中...