Extracting text from HTML file using Python

后端未结

关注

 30  2125

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

相关标签:

30条回答

无人及你

2020-11-22 04:17
I had a similar question and actually used one of the answers with BeautifulSoup. The problem was it was really slow. I ended up using library called selectolax. It's pretty limited but it works for this task. The only issue was that I had manually remove unnecessary white spaces. But it seems to be working much faster that BeautifulSoup solution.
```
from selectolax.parser import HTMLParser

def get_text_selectolax(html):
    tree = HTMLParser(html)

    if tree.body is None:
        return None

    for tag in tree.css('script'):
        tag.decompose()
    for tag in tree.css('style'):
        tag.decompose()

    text = tree.body.text(separator='')
    text = " ".join(text.split()) # this will remove all the whitespaces
    return text
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

北海茫月

2020-11-22 04:18

Best worked for me is inscripts .

https://github.com/weblyzard/inscriptis

import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)
print(text)

The results are really good

0 讨论(0)

生来不讨喜

2020-11-22 04:19
Instead of the HTMLParser module, check out htmllib. It has a similar interface, but does more of the work for you. (It is pretty ancient, so it's not much help in terms of getting rid of javascript and css. You could make a derived class, but and add methods with names like start_script and end_style (see the python docs for details), but it's hard to do this reliably for malformed html.) Anyway, here's something simple that prints the plain text to the console
```
from htmllib import HTMLParser, HTMLParseError
from formatter import AbstractFormatter, DumbWriter
p = HTMLParser(AbstractFormatter(DumbWriter()))
try: p.feed('hello<br>there'); p.close() #calling close is not usually needed, but let's play it safe
except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
时光说笑

2020-11-22 04:21
I know there are a lot of answers already, but the most elegent and pythonic solution I have found is described, in part, here.
```
from bs4 import BeautifulSoup

text = ''.join(BeautifulSoup(some_html_string, "html.parser").findAll(text=True))
```
Update

Based on Fraser's comment, here is more elegant solution:
```
from bs4 import BeautifulSoup

clean_text = ''.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
耶瑟儿～

2020-11-22 04:22
I know there's plenty of answers here already but I think newspaper3k also deserves a mention. I recently needed to complete a similar task of extracting the text from articles on the web and this library has done an excellent job of achieving this so far in my tests. It ignores the text found in menu items and side bars as well as any JavaScript that appears on the page as the OP requests.
```
from newspaper import Article

article = Article(url)
article.download()
article.parse()
article.text
```
If you already have the HTML files downloaded you can do something like this:
```
article = Article('')
article.set_html(html)
article.parse()
article.text
```
It even has a few NLP features for summarizing the topics of articles:
```
article.nlp()
article.summary
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2020-11-22 04:24

html2text is a Python program that does a pretty good job at this.

0 讨论(0)
发布评论:

提交评论
- 加载中...

Extracting text from HTML file using Python

Update