Extracting text from HTML file using Python

后端 未结 30 2098
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答
  •  无人及你
    2020-11-22 04:17

    I had a similar question and actually used one of the answers with BeautifulSoup. The problem was it was really slow. I ended up using library called selectolax. It's pretty limited but it works for this task. The only issue was that I had manually remove unnecessary white spaces. But it seems to be working much faster that BeautifulSoup solution.

    from selectolax.parser import HTMLParser
    
    def get_text_selectolax(html):
        tree = HTMLParser(html)
    
        if tree.body is None:
            return None
    
        for tag in tree.css('script'):
            tag.decompose()
        for tag in tree.css('style'):
            tag.decompose()
    
        text = tree.body.text(separator='')
        text = " ".join(text.split()) # this will remove all the whitespaces
        return text
    

提交回复
热议问题