HTML agility pack - removing unwanted tags without removing content?

前端 未结 5 2151
忘了有多久
忘了有多久 2020-11-29 03:00

I\'ve seen a few related questions out here, but they don’t exactly talk about the same problem I am facing.

I want to use the HTML Agility Pack to remove unwa

相关标签:
5条回答
  • 2020-11-29 03:10

    Before removing a node, get its parent and its InnerText, then remove the node and re-assign the InnerText to the parent.

    var parent = node.ParentNode;
    var innerText = parent.InnerText;
    node.Remove();
    parent.AppendChild(doc.CreateTextNode(innerText));
    
    0 讨论(0)
  • 2020-11-29 03:23

    If you do not want to use Html agility pack and still want to remove Unwanted Html Tag than you can do as given below.

    public static string RemoveHtmlTags(string strHtml)
        {
            string strText = Regex.Replace(strHtml, "<(.|\n)*?>", String.Empty);
            strText = HttpUtility.HtmlDecode(strText);
            strText = Regex.Replace(strText, @"\s+", " ");
            return strText;
        }
    
    0 讨论(0)
  • 2020-11-29 03:27

    I wrote an algorithm based on Oded's suggestions. Here it is. Works like a charm.

    It removes all tags except strong, em, u and raw text nodes.

    internal static string RemoveUnwantedTags(string data)
    {
        if(string.IsNullOrEmpty(data)) return string.Empty;
    
        var document = new HtmlDocument();
        document.LoadHtml(data);
    
        var acceptableTags = new String[] { "strong", "em", "u"};
    
        var nodes = new Queue<HtmlNode>(document.DocumentNode.SelectNodes("./*|./text()"));
        while(nodes.Count > 0)
        {
            var node = nodes.Dequeue();
            var parentNode = node.ParentNode;
    
            if(!acceptableTags.Contains(node.Name) && node.Name != "#text")
            {
                var childNodes = node.SelectNodes("./*|./text()");
    
                if (childNodes != null)
                {
                    foreach (var child in childNodes)
                    {
                        nodes.Enqueue(child);
                        parentNode.InsertBefore(child, node);
                    }
                }
    
                parentNode.RemoveChild(node);
    
            }
        }
    
        return document.DocumentNode.InnerHtml;
    }
    
    0 讨论(0)
  • 2020-11-29 03:27

    How to recursively remove a given list of unwanted html tags from an html string

    I took @mathias answer and improved his extension method so that you can supply a list of tags to exclude as a List<string> (e.g. {"a","p","hr"}). I also fixed the logic so that it works recursively properly:

    public static string RemoveUnwantedHtmlTags(this string html, List<string> unwantedTags)
        {
            if (String.IsNullOrEmpty(html))
            {
                return html;
            }
    
            var document = new HtmlDocument();
            document.LoadHtml(html);
    
            HtmlNodeCollection tryGetNodes = document.DocumentNode.SelectNodes("./*|./text()");
    
            if (tryGetNodes == null || !tryGetNodes.Any())
            {
                return html;
            }
    
            var nodes = new Queue<HtmlNode>(tryGetNodes);
    
            while (nodes.Count > 0)
            {
                var node = nodes.Dequeue();
                var parentNode = node.ParentNode;
    
                var childNodes = node.SelectNodes("./*|./text()");
    
                if (childNodes != null)
                {
                    foreach (var child in childNodes)
                    {
                        nodes.Enqueue(child);                       
                    }
                }
    
                if (unwantedTags.Any(tag => tag == node.Name))
                {               
                    if (childNodes != null)
                    {
                        foreach (var child in childNodes)
                        {
                            parentNode.InsertBefore(child, node);
                        }
                    }
    
                    parentNode.RemoveChild(node);
    
                }
            }
    
            return document.DocumentNode.InnerHtml;
        }
    
    0 讨论(0)
  • 2020-11-29 03:37

    Try the following, you might find it a bit neater than the other proposed solutions:

    public static int RemoveNodesButKeepChildren(this HtmlNode rootNode, string xPath)
    {
        HtmlNodeCollection nodes = rootNode.SelectNodes(xPath);
        if (nodes == null)
            return 0;
        foreach (HtmlNode node in nodes)
            node.RemoveButKeepChildren();
        return nodes.Count;
    }
    
    public static void RemoveButKeepChildren(this HtmlNode node)
    {
        foreach (HtmlNode child in node.ChildNodes)
            node.ParentNode.InsertBefore(child, node);
        node.Remove();
    }
    
    public static bool TestYourSpecificExample()
    {
        string html = "<p>my paragraph <div>and my <b>div</b></div> are <i>italic</i> and <b>bold</b></p>";
        HtmlDocument document = new HtmlDocument();
        document.LoadHtml(html);
        document.DocumentNode.RemoveNodesButKeepChildren("//div");
        document.DocumentNode.RemoveNodesButKeepChildren("//p");
        return document.DocumentNode.InnerHtml == "my paragraph and my <b>div</b> are <i>italic</i> and <b>bold</b>";
    }
    
    0 讨论(0)
提交回复
热议问题