Remove all style, scripts, and html tags from an html page

后端 未结 5 1939
面向向阳花
面向向阳花 2020-12-31 07:13

Here is what I have so far:

from bs4 import BeautifulSoup

def cleanme(html):
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loa         


        
相关标签:
5条回答
  • 2020-12-31 07:25

    It looks like you almost have it. You need to also remove the html tags and css styling code. Here is my solution (I updated the function):

    def cleanMe(html):
        soup = BeautifulSoup(html, "html.parser") # create a new bs4 object from the html data loaded
        for script in soup(["script", "style"]): # remove all javascript and stylesheet code
            script.extract()
        # get text
        text = soup.get_text()
        # break into lines and remove leading and trailing space on each
        lines = (line.strip() for line in text.splitlines())
        # break multi-headlines into a line each
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        # drop blank lines
        text = '\n'.join(chunk for chunk in chunks if chunk)
        return text
    
    0 讨论(0)
  • 2020-12-31 07:33

    Removing specified tags and comments in a clean manner. Thanks to Kim Hyesung for this code.

    from bs4 import BeautifulSoup
    from bs4 import Comment
    
    def cleanMe(html):
        soup = BeautifulSoup(html, "html5lib")    
        [x.extract() for x in soup.find_all('script')]
        [x.extract() for x in soup.find_all('style')]
        [x.extract() for x in soup.find_all('meta')]
        [x.extract() for x in soup.find_all('noscript')]
        [x.extract() for x in soup.find_all(text=lambda text:isinstance(text, Comment))]
        return soup
    
    0 讨论(0)
  • 2020-12-31 07:36

    If you want a quick and dirty solution you ca use:

    re.sub(r'<[^>]*?>', '', value)
    

    To make an equivalent of strip_tags in php. Is that what you want?

    0 讨论(0)
  • 2020-12-31 07:43

    You can use decompose to completely remove the tags from the document and stripped_strings generator to retrieve the tag content.

    def clean_me(html):
        soup = BeautifulSoup(html)
        for s in soup(['script', 'style']):
            s.decompose()
        return ' '.join(soup.stripped_strings)
    

    >>> clean_me(testhtml) 
    'THIS IS AN EXAMPLE I need this text captured And this'
    
    0 讨论(0)
  • 2020-12-31 07:46

    Using lxml instead:

    # Requirements: pip install lxml
    
    import lxml.html.clean
    
    
    def cleanme(content):
        cleaner = lxml.html.clean.Cleaner(
            allow_tags=[''],
            remove_unknown_tags=False,
            style=True,
        )
        html = lxml.html.document_fromstring(content)
        html_clean = cleaner.clean_html(html)
        return html_clean.text_content().strip()
    
    testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text captured<h1>And this</h1></body>"
    cleaned = cleanme(testhtml)
    print (cleaned)
    
    0 讨论(0)
提交回复
热议问题