Extracting text from HTML file using Python

后端 未结 30 2010
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答
  •  南笙
    南笙 (楼主)
    2020-11-22 04:37

    Another example using BeautifulSoup4 in Python 2.7.9+

    includes:

    import urllib2
    from bs4 import BeautifulSoup
    

    Code:

    def read_website_to_text(url):
        page = urllib2.urlopen(url)
        soup = BeautifulSoup(page, 'html.parser')
        for script in soup(["script", "style"]):
            script.extract() 
        text = soup.get_text()
        lines = (line.strip() for line in text.splitlines())
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        text = '\n'.join(chunk for chunk in chunks if chunk)
        return str(text.encode('utf-8'))
    

    Explained:

    Read in the url data as html (using BeautifulSoup), remove all script and style elements, and also get just the text using .get_text(). Break into lines and remove leading and trailing space on each, then break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")). Then using text = '\n'.join, drop blank lines, finally return as sanctioned utf-8.

    Notes:

    • Some systems this is run on will fail with https:// connections because of SSL issue, you can turn off the verify to fix that issue. Example fix: http://blog.pengyifan.com/how-to-fix-python-ssl-certificate_verify_failed/

    • Python < 2.7.9 may have some issue running this

    • text.encode('utf-8') can leave weird encoding, may want to just return str(text) instead.

提交回复
热议问题