How can I retrieve all the text nodes of a HTMLDocument in the fastest way in C#?

前端 未结 2 1148
Happy的楠姐
Happy的楠姐 2021-01-23 11:03

I need to perform some logic on all the text nodes of a HTMLDocument. This is how I currently do this:

HTMLDocument pageContent = (HTMLDocument)_webBrowser2.Docu         


        
相关标签:
2条回答
  • 2021-01-23 11:19

    You could access all the text nodes in one shot using XPath in HTML Agility Pack.

    I think this would work as shown, but have not tried this out.

    using HtmlAgilityPack;
    HtmlDocument htmlDoc = new HtmlDocument();
    
    // filePath is a path to a file containing the html
    htmlDoc.Load(filePath);
    HtmlNodeCollection coll = htmlDoc.DocumentNode.SelectNodes("//text()");
    
    foreach (HTMLNode node in coll)
    {
      // do the work for a text node here
    }
    
    0 讨论(0)
  • 2021-01-23 11:34

    It might be best to iterate over the childNodes (direct descendants) within a recursive function, starting at the top-level, something like:

    HtmlElementCollection collection = pageContent.GetElementsByTagName("HTML");
    IHTMLDOMNode htmlNode = (IHTMLDOMNode)collection[0];
    ProcessChildNodes(htmlNode);
    
    private void ProcessChildNodes(IHTMLDOMNode node)
    {
        foreach (IHTMLDOMNode childNode in node.childNodes)
        {
            if (childNode.nodeType == 3)
            {
                // ...
            }
            ProcessChildNodes(childNode);
        }
    }
    
    0 讨论(0)
提交回复
热议问题