I want to remove HTML comments from an html text
heading
some text <-- con --> more text
re.sub("(?s)<!--.+?-->", "", s)
or
re.sub("<!--.+?-->", "", s, flags=re.DOTALL)
You could try this regex <![^<]*>
Don't use regex. Use an XML parser instead, the one in the standard library is more than sufficient.
from xml.etree import ElementTree as ET
html = ET.parse("comments.html")
ET.dump(html) # Dumps to stdout
ET.write("no-comments.html", method="html") # Write to a file
Finally came up with this option:
re.sub("(<!--.*?-->)", "", t)
Adding the ?
makes the search non-greedy and does not combine multiple comment tags.
html = re.sub(r"<!--(.|\s|\n)*?-->", "", html)
re.sub basically find the matching instance and replace with the second arguments. For this case, <!--(.|\s|\n)*?-->
matches anything start with <!--
and end with -->
. The dot and ? means anything, and the \s and \n add the cases of muti line comment.
You shouldn't ignore Carriage return.
re.sub("(<!--.*?-->)", "", s, flags=re.DOTALL)