Remove all style, scripts, and html tags from an html page

后端 未结 5 1938
面向向阳花
面向向阳花 2020-12-31 07:13

Here is what I have so far:

from bs4 import BeautifulSoup

def cleanme(html):
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loa         


        
5条回答
  •  礼貌的吻别
    2020-12-31 07:43

    You can use decompose to completely remove the tags from the document and stripped_strings generator to retrieve the tag content.

    def clean_me(html):
        soup = BeautifulSoup(html)
        for s in soup(['script', 'style']):
            s.decompose()
        return ' '.join(soup.stripped_strings)
    

    >>> clean_me(testhtml) 
    'THIS IS AN EXAMPLE I need this text captured And this'
    

提交回复
热议问题