Extracting text from HTML file using Python

后端 未结 30 2012
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

相关标签:
30条回答
  • 2020-11-22 04:17

    I had a similar question and actually used one of the answers with BeautifulSoup. The problem was it was really slow. I ended up using library called selectolax. It's pretty limited but it works for this task. The only issue was that I had manually remove unnecessary white spaces. But it seems to be working much faster that BeautifulSoup solution.

    from selectolax.parser import HTMLParser
    
    def get_text_selectolax(html):
        tree = HTMLParser(html)
    
        if tree.body is None:
            return None
    
        for tag in tree.css('script'):
            tag.decompose()
        for tag in tree.css('style'):
            tag.decompose()
    
        text = tree.body.text(separator='')
        text = " ".join(text.split()) # this will remove all the whitespaces
        return text
    
    0 讨论(0)
  • 2020-11-22 04:18

    Best worked for me is inscripts .

    https://github.com/weblyzard/inscriptis

    import urllib.request
    from inscriptis import get_text
    
    url = "http://www.informationscience.ch"
    html = urllib.request.urlopen(url).read().decode('utf-8')
    
    text = get_text(html)
    print(text)
    

    The results are really good

    0 讨论(0)
  • 2020-11-22 04:19

    Instead of the HTMLParser module, check out htmllib. It has a similar interface, but does more of the work for you. (It is pretty ancient, so it's not much help in terms of getting rid of javascript and css. You could make a derived class, but and add methods with names like start_script and end_style (see the python docs for details), but it's hard to do this reliably for malformed html.) Anyway, here's something simple that prints the plain text to the console

    from htmllib import HTMLParser, HTMLParseError
    from formatter import AbstractFormatter, DumbWriter
    p = HTMLParser(AbstractFormatter(DumbWriter()))
    try: p.feed('hello<br>there'); p.close() #calling close is not usually needed, but let's play it safe
    except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)
    
    0 讨论(0)
  • 2020-11-22 04:21

    I know there are a lot of answers already, but the most elegent and pythonic solution I have found is described, in part, here.

    from bs4 import BeautifulSoup
    
    text = ''.join(BeautifulSoup(some_html_string, "html.parser").findAll(text=True))
    

    Update

    Based on Fraser's comment, here is more elegant solution:

    from bs4 import BeautifulSoup
    
    clean_text = ''.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)
    
    0 讨论(0)
  • 2020-11-22 04:22

    I know there's plenty of answers here already but I think newspaper3k also deserves a mention. I recently needed to complete a similar task of extracting the text from articles on the web and this library has done an excellent job of achieving this so far in my tests. It ignores the text found in menu items and side bars as well as any JavaScript that appears on the page as the OP requests.

    from newspaper import Article
    
    article = Article(url)
    article.download()
    article.parse()
    article.text
    

    If you already have the HTML files downloaded you can do something like this:

    article = Article('')
    article.set_html(html)
    article.parse()
    article.text
    

    It even has a few NLP features for summarizing the topics of articles:

    article.nlp()
    article.summary
    
    0 讨论(0)
  • 2020-11-22 04:24

    html2text is a Python program that does a pretty good job at this.

    0 讨论(0)
提交回复
热议问题