How to remove HTML comments using Regex in Python

后端 未结 6 1606
眼角桃花
眼角桃花 2020-12-19 11:23

I want to remove HTML comments from an html text

heading

some text <-- con --> more text
相关标签:
6条回答
  • 2020-12-19 11:53
    re.sub("(?s)<!--.+?-->", "", s)
    

    or

    re.sub("<!--.+?-->", "", s, flags=re.DOTALL)
    
    0 讨论(0)
  • 2020-12-19 11:56

    You could try this regex <![^<]*>

    0 讨论(0)
  • 2020-12-19 11:59

    Don't use regex. Use an XML parser instead, the one in the standard library is more than sufficient.

    from xml.etree import ElementTree as ET
    html = ET.parse("comments.html")
    ET.dump(html) # Dumps to stdout
    ET.write("no-comments.html", method="html") # Write to a file
    
    0 讨论(0)
  • 2020-12-19 12:03

    Finally came up with this option:

    re.sub("(<!--.*?-->)", "", t)

    Adding the ? makes the search non-greedy and does not combine multiple comment tags.

    0 讨论(0)
  • 2020-12-19 12:10
    html = re.sub(r"<!--(.|\s|\n)*?-->", "", html)
    

    re.sub basically find the matching instance and replace with the second arguments. For this case, <!--(.|\s|\n)*?--> matches anything start with <!-- and end with -->. The dot and ? means anything, and the \s and \n add the cases of muti line comment.

    0 讨论(0)
  • 2020-12-19 12:12

    You shouldn't ignore Carriage return.

    re.sub("(<!--.*?-->)", "", s, flags=re.DOTALL)
    
    0 讨论(0)
提交回复
热议问题