How can I select from only one table with Web::Scraper?

后端 未结 3 1210
梦毁少年i
梦毁少年i 2021-01-25 06:19

I want to extract the text only for heading Node Object Methods from a webpage. The specific HMTL part is as follows:

Node Object Properties

相关标签:
3条回答
  • 2021-01-25 06:34

    Web::Scraper can use nth_of_type to choose the right table. There are two tables with the same class, so you can say table.reference:nth-of-type(2):

    use v5.22;
    
    use feature qw(postderef);
    no warnings qw(experimental::postderef);
    
    
    use Web::Scraper;
    
    my $html = do { local $/; <DATA> };
    
    my $methods = scraper {
        process "table.reference:nth-of-type(2) > tr > td > a", 'renners[]' => 'TEXT';
        };
    my $res = $methods->scrape( $html );
    
    say join "\n", $res->{renners}->@*;
    

    And here's a Mojo::DOM:

    use Mojo::DOM;
    
    my $html = do { local $/; <DATA> };
    
    my $dom = Mojo::DOM->new( $html );
    
    say $dom
        ->find( 'table.reference:nth-of-type(2) > tr > td > a' )
        ->map( 'text' )
        ->join( "\n" );
    

    I tried looking for a selector solution that could recognize the text in the h2, but my kung fu is weak here.

    0 讨论(0)
  • 2021-01-25 06:46

    Web::Query provides an almost identical solution to the Mojo::DOM solution proposed by brian d foy.

    use Web::Query;
    
    my $html = do { local $/; <DATA> };
    
    wq($html)
        ->find('table.reference:nth-of-type(2) > tr > td > a')
        ->each(sub {
            my ($i, $e) = @_;
            say $e->text();
        });
    

    However it looks like Mojo::DOM is the more robust library. For Web::Query to correctly match with its selector I had to edit the input provided in the question to add a root node surrounding all the other content.

    __DATA__
    <html>
    ...
    </html>
    
    0 讨论(0)
  • 2021-01-25 06:53

    You can use XPath to extract data from the very next table after the heading Node Object Methods, like so

    use Web::Scraper;
    
    my $html = do { local $/; <DATA> };
    
    my $methods = scraper {
        process '//h2[.="Node Object Methods"]/following-sibling::table[1]//tr/td[1]', 
            'renners[]' => 'TEXT';
    };  
    my $res = $methods->scrape( $html );
    
    say join "\n", @{ $res->{renners} };
    

    The output will be

    appendChild()
    cloneNode()
    compareDocumentPosition()
    getFeature(feature,version)
    getUserData(key)
    hasAttributes()
    hasChildNodes()
    insertBefore()
    
    0 讨论(0)
提交回复
热议问题