Extracting text from HTML file using Python

后端 未结 30 2130
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答
  •  死守一世寂寞
    2020-11-22 04:13

    Beautiful soup does convert html entities. It's probably your best bet considering HTML is often buggy and filled with unicode and html encoding issues. This is the code I use to convert html to raw text:

    import BeautifulSoup
    def getsoup(data, to_unicode=False):
        data = data.replace(" ", " ")
        # Fixes for bad markup I've seen in the wild.  Remove if not applicable.
        masssage_bad_comments = [
            (re.compile(''), lambda match: ''),
        ]
        myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
        myNewMassage.extend(masssage_bad_comments)
        return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage,
            convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES 
                        if to_unicode else None)
    
    remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else ""
    

提交回复
热议问题