Remove all javascript tags and style tags from html with python and the lxml module

后端 未结 4 2065
南笙
南笙 2020-12-23 12:11

I am parsing an html document using the http://lxml.de/ library. So far I have figured out how to strip tags from an html document In lxml, how do I remove a tag but retain

4条回答
  •  时光说笑
    2020-12-23 12:28

    Here are some examples of how to remove and parse different types of HTML elements from a XML/HTML tree.

    KEY SUGGESTION: Its helpful to NOT depend on external libraries and do everything within "native python 2/3 code".

    Here are some examples of how to do this with "native" python...

    # (REMOVE  and variations)
    pattern = r'<[ ]*script.*?\/[ ]*script[ ]*>'  # mach any char zero or more times
    text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
    
    # (REMOVE HTML  and variations)
    pattern = r'<[ ]*style.*?\/[ ]*style[ ]*>'  # mach any char zero or more times
    text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
    
    # (REMOVE HTML  to  and variations)
    pattern = r'<[ ]*meta.*?>'  # mach any char zero or more times
    text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
    
    # (REMOVE HTML COMMENTS  and variations)
    pattern = r'<[ ]*!--.*?--[ ]*>'  # mach any char zero or more times
    text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
    
    # (REMOVE HTML DOCTYPE  and variations)
    pattern = r'<[ ]*\![ ]*DOCTYPE.*?>'  # mach any char zero or more times
    text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
    

    NOTE:

    re.IGNORECASE # is needed to match case sensitive 
    
                                     
                  
提交回复
热议问题