Extracting text from HTML file using Python

后端未结

关注

 30  2010

一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答

南笙 (楼主)

2020-11-22 04:37
Another example using BeautifulSoup4 in Python 2.7.9+

includes:
```
import urllib2
from bs4 import BeautifulSoup
```
Code:
```
def read_website_to_text(url):
    page = urllib2.urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    for script in soup(["script", "style"]):
        script.extract() 
    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return str(text.encode('utf-8'))
```
Explained:

Read in the url data as html (using BeautifulSoup), remove all script and style elements, and also get just the text using .get_text(). Break into lines and remove leading and trailing space on each, then break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")). Then using text = '\n'.join, drop blank lines, finally return as sanctioned utf-8.

Notes:
- Some systems this is run on will fail with https:// connections because of SSL issue, you can turn off the verify to fix that issue. Example fix: http://blog.pengyifan.com/how-to-fix-python-ssl-certificate_verify_failed/
- Python < 2.7.9 may have some issue running this
- text.encode('utf-8') can leave weird encoding, may want to just return str(text) instead.
0 讨论(0)

查看其它30个回答
发布评论:

提交评论
- 加载中...