How can I get the ultimate URL without fetching the pages using Perl and LWP?

蓝咒 提交于 2019-12-22 05:34:19

问题


I'm doing some web scraping using Perl's LWP. I need to process a set of URLs, some of which may redirect (1 or more times).

How can I get ultimate URL with all redirects resolved, using HEAD method?


回答1:


If you use the fully featured version of LWP::UserAgent, then the response that is returned is an instance of HTTP::Response which in turn has as an attribute an HTTP::Request. Note that this is NOT necessarily the same HTTP::Request that you created with the original URL in your set of URLs, as described in the HTTP::Response documentation for the method to retrieve the request instance within the response instance:

$r->request( $request )

This is used to get/set the request attribute. The request attribute is a reference to the the request that caused this response. It does not have to be the same request passed to the $ua->request() method, because there might have been redirects and authorization retries in between.

Once you have the request object, you can use the uri method to get the URI. If redirects were used, the URI is the result of following the chain of redirects.

Here's a Perl script, tested and verified, that gives you the skeleton of what you need:

#!/usr/bin/perl

use strict;
use warnings;

use LWP::UserAgent;

my $ua;  # Instance of LWP::UserAgent
my $req; # Instance of (original) request
my $res; # Instance of HTTP::Response returned via request method

$ua = LWP::UserAgent->new;
$ua->agent("$0/0.1 " . $ua->agent);

$req = HTTP::Request->new(HEAD => 'http://www.ecu.edu/wllc');
$req->header('Accept' => 'text/html');

$res = $ua->request($req);

if ($res->is_success) {
    # Using double method invocation, prob. want to do testing of
    # whether res is defined.
    # This is inline version of
    # my $finalrequest = $res->request(); 
    # print "Final URL = " . $finalrequest->url() . "\n";
    print "Final URI = " . $res->request()->uri() . "\n";
} else {
    print "Error: " . $res->status_line . "\n";
}



回答2:


As stated in perldoc LWP::UserAgent, the default is to follow redirects for GET and HEAD requests:

$ua = LWP::UserAgent->new( %options )

...
       KEY                     DEFAULT
       -----------             --------------------
       max_redirect            7
       ...
       requests_redirectable   ['GET', 'HEAD']

Here is an example:

#!/usr/bin/perl

use strict; use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new();
$ua->show_progress(1);

my $response = $ua->head('http://unur.com/');

if ( $response->is_success ) {
    print $response->request->uri->as_string, "\n";
}

Output:

** HEAD http://unur.com/ ==> 301 Moved Permanently (1s)
** HEAD http://www.unur.com/ ==> 200 OK
http://www.unur.com/


来源:https://stackoverflow.com/questions/2470053/how-can-i-get-the-ultimate-url-without-fetching-the-pages-using-perl-and-lwp

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!