htmlagilitypack - remove script and style?

前端 未结 3 1714
走了就别回头了
走了就别回头了 2020-11-29 04:32

Im using the following method to extract text form html:

    public string getAllText(string _html)
    {
        string _allText = \"\";
        try
                


        
相关标签:
3条回答
  • 2020-11-29 04:44

    Some excellent answers, System.Linq is handy!

    For a non Linq based approach:

    private HtmlAgilityPack.HtmlDocument RemoveScripts(HtmlAgilityPack.HtmlDocument webDocument)
    {
    
    // Get all Nodes: script
    HtmlAgilityPack.HtmlNodeCollection Nodes = webDocument.DocumentNode.SelectNodes("//script");
    
    // Make sure not Null:
    if (Nodes == null)
        return webDocument;
    
    // Remove all Nodes:
    foreach (HtmlNode node in Nodes)
        node.Remove();
    
    return webDocument;
    
    }
    
    0 讨论(0)
  • 2020-11-29 04:52

    You can do so using HtmlDocument class:

    HtmlDocument doc = new HtmlDocument();
    
    doc.LoadHtml(input);
    
    doc.DocumentNode.SelectNodes("//style|//script").ToList().ForEach(n => n.Remove());
    
    0 讨论(0)
  • 2020-11-29 04:56
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);
    
    doc.DocumentNode.Descendants()
                    .Where(n => n.Name == "script" || n.Name == "style")
                    .ToList()
                    .ForEach(n => n.Remove());
    
    0 讨论(0)
提交回复
热议问题