Extracting text from HTML file using Python

后端 未结 30 2128
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答
  •  时光说笑
    2020-11-22 04:21

    I know there are a lot of answers already, but the most elegent and pythonic solution I have found is described, in part, here.

    from bs4 import BeautifulSoup
    
    text = ''.join(BeautifulSoup(some_html_string, "html.parser").findAll(text=True))
    

    Update

    Based on Fraser's comment, here is more elegant solution:

    from bs4 import BeautifulSoup
    
    clean_text = ''.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)
    

提交回复
热议问题