How do I extract links from HTML with a Perl regex?

前端 未结 1 1914
南旧
南旧 2021-01-17 07:02

I have a HUGE html which has many things I don\'t need, but inside it has URLs that are provided in the following format:



        
相关标签:
1条回答
  • 2021-01-17 07:35

    Use HTML::SimpleLinkExtor, HTML::LinkExtor, or one of the other link extracting Perl modules. You don't need a regex at all.

    Here's a short example. You don't have to subclass. You just have to tell %HTML::Tagset::linkElements which attributes to collect:

    #!perl
    use HTML::LinkExtor;
    
    $HTML::Tagset::linkElements{'a'} = [ qw( href class ) ];
    
    $p = HTML::LinkExtor->new;
    $p->parse( do { local $/; <> } );
    
    my @links = grep { 
        my( $tag, %hash ) = @$_;
        no warnings 'uninitialized';
        $hash{class} eq 'foo';
        } $p->links;
    

    If you need to collect URLs for any other tags, you make similar adjustments.

    If you'd rather have a callback routine, that's not so hard either. You can watch the links as the parser runs into them:

    use HTML::LinkExtor;
    
    $HTML::Tagset::linkElements{'a'} = [ qw( href class ) ];
    
    my @links;
    my $callback = sub {
        my( $tag, %hash ) = @_;
        no warnings 'uninitialized';
        push @links, $hash{href} if $hash{class} eq 'foo';
        };
    
    my $p = HTML::LinkExtor->new( $callback );
    $p->parse( do { local $/; <DATA> } );
    
    0 讨论(0)
提交回复
热议问题