C# How to delete XML/HTML comments with regular expression

前端 未结 4 1215
说谎
说谎 2020-12-09 04:17

The fragment below doesn\'t work for me.

fragment = Regex.Replace(fragment, \"\", String.Empty , RegexOptions.Multiline  );
相关标签:
4条回答
  • 2020-12-09 05:10

    This one works for me:

    <!--(\n|.)*-->
    

    But I think you could use normal XML document for the XML or otherwise HtmlAgilityPack for HTML. Highly not recommending to parse markup using RegEx.

    0 讨论(0)
  • 2020-12-09 05:13

    Please don't use regular expressions to work with markup languages - you need to use a better tool that is built for that kind of job.

    Use the Html Agiliy Pack instead. I even found this article in which a reader (named Simon Mourier) comments with a function that uses the Html Agility Pack to remove comments from a document:

    Simon Mourier said:

    This is a sample code to remove comments:

    static void Main(string[] args) 
    { 
      HtmlDocument doc = new HtmlDocument(); 
      doc.Load("filewithcomments.htm"); 
      doc.Save(Console.Out); // show before 
      RemoveComments(doc.DocumentNode); 
      doc.Save(Console.Out); // show after 
    } 
    
    static void RemoveComments(HtmlNode node)
    {
        if (!node.HasChildNodes)
        {
            return;
        }
    
        for (int i=0; i<node.ChildNodes.Count; i++)
        {
            if (node.ChildNodes[i].NodeType == HtmlNodeType.Comment)
            {
                node.ChildNodes.RemoveAt(i);
                --i;
            }
        }
    
        foreach (HtmlNode subNode in node.ChildNodes)
        {
            RemoveComments(subNode);
        }
    }
    
    0 讨论(0)
  • 2020-12-09 05:19

    Change it to RegExOptions.Singleline instead and it'll work just fine. When not in Singleline mode, the dot matches any character, except newline.

    Note that Singleline and Multiline are not mutually exclusive. They do two separate things. To quote MSDN:

    Multiline mode. Changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string.

    Single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n).

    Other people have already suggested the HTML Agility Pack. I just felt you should have an explanation on why your Regex wouldn't work :)

    0 讨论(0)
  • 2020-12-09 05:21

    This is the top Google result for stripping comments via C#, and here's my HtmlAgilityPack code for doing this.

            HtmlDocument doc = new HtmlDocument
                               {
                                   OptionFixNestedTags = true,
                                   OptionOutputAsXml = true
                               };
            doc.LoadHtml(str);
    
            // Script comments from the document. 
            if (doc.DocumentNode != null)
            {
                HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//comment()");
                if (nodes != null)
                {
                    foreach (HtmlNode node in from cmt in nodes
                                              where (cmt != null
                                                     && cmt.InnerText != null
                                                     && !cmt.InnerText.ToUpper().StartsWith("DOCTYPE"))
                                                     && cmt.ParentNode != null
                                              select cmt)
                    {
                        node.ParentNode.RemoveChild(node);
                    }
                }
            }
    

    This works correctly at stripping comments, and ignores the doctype which is treated as a comment by HtmlAgilityPack.

    While regex does work in controlled conditions. If you're processing HTML from the wild web then I'd recommend using HtmlAgilityPack. The HTML that is out there is very unpredictable, and regex will break.

    0 讨论(0)
提交回复
热议问题