I am parsing an html document using the http://lxml.de/ library. So far I have figured out how to strip tags from an html document In lxml, how do I remove a tag but retain
Here are some examples of how to remove and parse different types of HTML elements from a XML/HTML tree.
KEY SUGGESTION: Its helpful to NOT depend on external libraries and do everything within "native python 2/3 code".
Here are some examples of how to do this with "native" python...
# (REMOVE and variations)
pattern = r'<[ ]*script.*?\/[ ]*script[ ]*>' # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
# (REMOVE HTML and variations)
pattern = r'<[ ]*style.*?\/[ ]*style[ ]*>' # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
# (REMOVE HTML to and variations)
pattern = r'<[ ]*meta.*?>' # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
# (REMOVE HTML COMMENTS and variations)
pattern = r'<[ ]*!--.*?--[ ]*>' # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
# (REMOVE HTML DOCTYPE and variations)
pattern = r'<[ ]*\![ ]*DOCTYPE.*?>' # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
NOTE:
re.IGNORECASE # is needed to match case sensitive