How can I retrieve all the text nodes of a HTMLDocument in the fastest way in C#?

前端未结

关注

 2  1148

I need to perform some logic on all the text nodes of a HTMLDocument. This is how I currently do this:

HTMLDocument pageContent = (HTMLDocument)_webBrowser2.Docu


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  北荒        
                
              
                            
                2021-01-23 11:19
              
            
            
                                                                       
You could access all the text nodes in one shot using XPath in HTML Agility Pack.

I think this would work as shown, but have not tried this out.

using HtmlAgilityPack;
HtmlDocument htmlDoc = new HtmlDocument();

// filePath is a path to a file containing the html
htmlDoc.Load(filePath);
HtmlNodeCollection coll = htmlDoc.DocumentNode.SelectNodes("//text()");

foreach (HTMLNode node in coll)
{
  // do the work for a text node here
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  攒了一身酷        
                
              
                            
                2021-01-23 11:34
              
            
            
                                                                       
It might be best to iterate over the childNodes (direct descendants) within a recursive function, starting at the top-level, something like:

HtmlElementCollection collection = pageContent.GetElementsByTagName("HTML");
IHTMLDOMNode htmlNode = (IHTMLDOMNode)collection[0];
ProcessChildNodes(htmlNode);

private void ProcessChildNodes(IHTMLDOMNode node)
{
    foreach (IHTMLDOMNode childNode in node.childNodes)
    {
        if (childNode.nodeType == 3)
        {
            // ...
        }
        ProcessChildNodes(childNode);
    }
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复