How can I select from only one table with Web::Scraper?

后端未结

关注

 3  1212

I want to extract the text only for heading Node Object Methods from a webpage. The specific HMTL part is as follows:

Node Object Properties


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  无人共我        
                
              
                            
                2021-01-25 06:34
              
            
            
                                                                       
Web::Scraper can use nth_of_type to choose the right table. There are two tables with the same class, so you can say table.reference:nth-of-type(2):

use v5.22;

use feature qw(postderef);
no warnings qw(experimental::postderef);


use Web::Scraper;

my $html = do { local $/; <DATA> };

my $methods = scraper {
    process "table.reference:nth-of-type(2) > tr > td > a", 'renners[]' => 'TEXT';
    };
my $res = $methods->scrape( $html );

say join "\n", $res->{renners}->@*;


And here's a Mojo::DOM:

use Mojo::DOM;

my $html = do { local $/; <DATA> };

my $dom = Mojo::DOM->new( $html );

say $dom
    ->find( 'table.reference:nth-of-type(2) > tr > td > a' )
    ->map( 'text' )
    ->join( "\n" );


I tried looking for a selector solution that could recognize the text in the h2, but my kung fu is weak here.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  广开言路        
                
              
                            
                2021-01-25 06:46
              
            
            
                                                                       
Web::Query provides an almost identical solution to the Mojo::DOM solution proposed by brian d foy.

use Web::Query;

my $html = do { local $/; <DATA> };

wq($html)
    ->find('table.reference:nth-of-type(2) > tr > td > a')
    ->each(sub {
        my ($i, $e) = @_;
        say $e->text();
    });


However it looks like Mojo::DOM is the more robust library. For Web::Query to correctly match with its selector I had to edit the input provided in the question to add a root node surrounding all the other content.

__DATA__
<html>
...
</html>

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  [愿得一人]        
                
              
                            
                2021-01-25 06:53
              
            
            
                                                                       
You can use XPath to extract data from the very next table after the heading Node Object Methods, like so

use Web::Scraper;

my $html = do { local $/; <DATA> };

my $methods = scraper {
    process '//h2[.="Node Object Methods"]/following-sibling::table[1]//tr/td[1]', 
        'renners[]' => 'TEXT';
};  
my $res = $methods->scrape( $html );

say join "\n", @{ $res->{renners} };


The output will be

appendChild()
cloneNode()
compareDocumentPosition()
getFeature(feature,version)
getUserData(key)
hasAttributes()
hasChildNodes()
insertBefore()

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复