htmlagilitypack - remove script and style?

删除回忆录丶 提交于 2019-12-17 06:05:31

问题


Im using the following method to extract text form html:

    public string getAllText(string _html)
    {
        string _allText = "";
        try
        {
            HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
            document.LoadHtml(_html);


            var root = document.DocumentNode;
            var sb = new StringBuilder();
            foreach (var node in root.DescendantNodesAndSelf())
            {
                if (!node.HasChildNodes)
                {
                    string text = node.InnerText;
                    if (!string.IsNullOrEmpty(text))
                        sb.AppendLine(text.Trim());
                }
            }

            _allText = sb.ToString();

        }
        catch (Exception)
        {
        }

        _allText = System.Web.HttpUtility.HtmlDecode(_allText);

        return _allText;
    }

Problem is that i also get script and style tags.

How could i exclude them?


回答1:


HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

doc.DocumentNode.Descendants()
                .Where(n => n.Name == "script" || n.Name == "style")
                .ToList()
                .ForEach(n => n.Remove());



回答2:


You can do so using HtmlDocument class:

HtmlDocument doc = new HtmlDocument();

doc.LoadHtml(input);

doc.DocumentNode.SelectNodes("//style|//script").ToList().ForEach(n => n.Remove());



回答3:


Some excellent answers, System.Linq is handy!

For a non Linq based approach:

private HtmlAgilityPack.HtmlDocument RemoveScripts(HtmlAgilityPack.HtmlDocument webDocument)
{

// Get all Nodes: script
HtmlAgilityPack.HtmlNodeCollection Nodes = webDocument.DocumentNode.SelectNodes("//script");

// Make sure not Null:
if (Nodes == null)
    return webDocument;

// Remove all Nodes:
foreach (HtmlNode node in Nodes)
    node.Remove();

return webDocument;

}


来源:https://stackoverflow.com/questions/13441470/htmlagilitypack-remove-script-and-style

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!