Parsing html with the HTML Agility Pack and Linq

前端未结

关注

 5  1575

I have the following HTML

(..)

 
   Test1 
   Data


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  后悔当初        
                
              
                            
                2021-02-02 18:45
              
            
            
                                                                       
As for your attempt, you have two issues with your code:


ChildNodes is weird - it also returns whitespace text nodes, which don't have a class attributes (can't have attributes, of course).
As James Walford commented, the spaces around the text are significant, you probably want to trim them.


With these two corrections, the following works:

var data =
      from tr in doc.DocumentNode.Descendants("tr")
      from td in tr.Descendants("td").Where(x => x.Attributes["class"].Value == "name")
     where td.InnerText.Trim() == "Test1"
    select tr;

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  抹茶落季        
                
              
                            
                2021-02-02 18:49
              
            
            
                                                                       
Here's one approach  - first parse all data into a data structure, and then read it. This is a little messy and certainly needs more validation, but here goes:

HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load("http://jsbin.com/ezuge4");
HtmlNodeCollection nodes = doc.DocumentNode
                              .SelectNodes("//table[@id='MyTable']//tr");
var data = nodes.Select(
    node => node.Descendants("td")
        .ToDictionary(descendant => descendant.Attributes["class"].Value,
                      descendant => descendant.InnerText.Trim())
        ).ToDictionary(dict => dict["name"]);
string test1Data = data["Test1"]["data"];


Here I turn every <tr> to a dictionary, where the class of the <td> is a key and the text is a value. Next, I turn the list of dictionaries into a dictionary of dictionaries (tip - abstract that away), where the name of every <tr> is the key.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  傲寒        
                
              
                            
                2021-02-02 18:55
              
            
            
                                                                       
instead of 

td.InnerText == "Test1"


try

td.InnerText == " Test1 "


or

d.InnerText.Trim() == "Test1"

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  有刺的猬        
                
              
                            
                2021-02-02 18:55
              
            
            
                                                                       
I can recommend one of two ways:

http://htmlagilitypack.codeplex.com/, which converts the html to valid xml which can then be queried against with OOTB Linq.

Or,

Linq to HTML (http://www.superstarcoders.com/linq-to-html.aspx), which while not maintained on CodePlex ( that was a hint, Keith ), gives a reasonable working set of features to springboard from.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  耶瑟儿～        
                
              
                            
                2021-02-02 19:00
              
            
            
                                                                       
Here is the XPATH way - hmmm... everyone seems to have forgotten about the power XPATH and concentrate exclusively on C# XLinq, these days :-)

This function gets all data values associated with a name:

public static IEnumerable<string> GetData(HtmlDocument document, string name)
{
    return from HtmlNode node in
        document.DocumentNode.SelectNodes("//td[@class='name' and contains(text(), '" + name + "')]/following-sibling::td")
        select node.InnerText.Trim();
}


For example, this code will dump all 'Test2' data:

    HtmlDocument doc = new HtmlDocument();
    doc.Load(yourHtml);

    foreach (string data in GetData(doc, "Test2"))
    {
        Console.WriteLine(data);
    }

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复