Remove all javascript tags and style tags from html with python and the lxml module

后端 未结 4 2052
南笙
南笙 2020-12-23 12:11

I am parsing an html document using the http://lxml.de/ library. So far I have figured out how to strip tags from an html document In lxml, how do I remove a tag but retain

相关标签:
4条回答
  • 2020-12-23 12:28

    Here are some examples of how to remove and parse different types of HTML elements from a XML/HTML tree.

    KEY SUGGESTION: Its helpful to NOT depend on external libraries and do everything within "native python 2/3 code".

    Here are some examples of how to do this with "native" python...

    # (REMOVE <SCRIPT> to </script> and variations)
    pattern = r'<[ ]*script.*?\/[ ]*script[ ]*>'  # mach any char zero or more times
    text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
    
    # (REMOVE HTML <STYLE> to </style> and variations)
    pattern = r'<[ ]*style.*?\/[ ]*style[ ]*>'  # mach any char zero or more times
    text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
    
    # (REMOVE HTML <META> to </meta> and variations)
    pattern = r'<[ ]*meta.*?>'  # mach any char zero or more times
    text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
    
    # (REMOVE HTML COMMENTS <!-- to --> and variations)
    pattern = r'<[ ]*!--.*?--[ ]*>'  # mach any char zero or more times
    text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
    
    # (REMOVE HTML DOCTYPE <!DOCTYPE html to > and variations)
    pattern = r'<[ ]*\![ ]*DOCTYPE.*?>'  # mach any char zero or more times
    text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
    

    NOTE:

    re.IGNORECASE # is needed to match case sensitive <script> or <SCRIPT> or <Script>
    re.MULTILINE # is needed to match newlines
    re.DOTALL # is needed to match "special characters" and match "any character" 
    

    I've tested this out on several different HTML files and including , , and and it works "fast" and works across newlines!..

    NOTE: It also does NOT depend on beautifulsoup or any other external downloaded library!

    Hope this helps!

    :)

    0 讨论(0)
  • 2020-12-23 12:31

    Below is an example to do what you want. For an HTML document, Cleaner is a better general solution to the problem than using strip_elements, because in cases like this you want to strip out more than just the <script> tag; you also want to get rid of things like onclick=function() attributes on other tags.

    #!/usr/bin/env python
    
    import lxml
    from lxml.html.clean import Cleaner
    
    cleaner = Cleaner()
    cleaner.javascript = True # This is True because we want to activate the javascript filter
    cleaner.style = True      # This is True because we want to activate the styles & stylesheet filter
    
    print("WITH JAVASCRIPT & STYLES")
    print(lxml.html.tostring(lxml.html.parse('http://www.google.com')))
    print("WITHOUT JAVASCRIPT & STYLES")
    print(lxml.html.tostring(cleaner.clean_html(lxml.html.parse('http://www.google.com'))))
    

    You can get a list of the options you can set in the lxml.html.clean.Cleaner documentation; some options you can just set to True or False (the default) and others take a list like:

    cleaner.kill_tags = ['a', 'h1']
    cleaner.remove_tags = ['p']
    

    Note that the difference between kill vs remove:

    remove_tags:
      A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag.
    kill_tags:
      A list of tags to kill. Killing also removes the tag's content, i.e. the whole subtree, not just the tag itself.
    allow_tags:
      A list of tags to include (default include all).
    
    0 讨论(0)
  • 2020-12-23 12:39

    You can use the strip_elements method to remove scripts, then use strip_tags method to remove other tags:

    etree.strip_elements(fragment, 'script')
    etree.strip_tags(fragment, 'a', 'p') # and other tags that you want to remove
    
    0 讨论(0)
  • 2020-12-23 12:40

    You can use bs4 libray also for this purpose.

    soup = BeautifulSoup(html_src, "lxml")
    [x.extract() for x in soup.findAll(['script', 'style'])]
    
    0 讨论(0)
提交回复
热议问题