How to detect a changed webpage?

烂漫一生 提交于 2019-12-10 19:20:01

问题


In my application, I fetch webpages periodically using LWP. Is there anyway to check whether between two consecutive fetches the webpage has got changed in some respect (other than explicitly doing a comparison) ? Is there any signature(say CRC) that is being generated at lower protocol layers which can be extracted and compared against older signatures to see possible changes ?


回答1:


There are two possible approaches. One is to use a digest of the page, e.g.

use strict;
use warnings;

use Digest::MD5 'md5_hex';
use LWP::UserAgent;

# fetch the page, etc.
my $digest = md5_hex $response->decoded_content;

if ( $digest ne $saved_digest ) { 
    # the page has changed.
}

Another option is to use an HTTP ETag, if the server provides one for the resource requested. You can simply store it and then set your request headers to include an If-None-Match field on subsequent requests. If the server ETag has remained the same, you'll get a 304 Not Modified status and an empty response body. Otherwise you'll get the new page. (And new ETag.) See Entity Tags in RFC2616.

Of course, the server could be lying, and sending the same ETag even though the content has changed. There's no way to know unless you look.




回答2:


You should use the If-Modified-Since request header, noting the gotchas in the RFC. You send this header with the request. If the server supports it and thinks the content is newer, it sends it to you. If it thinks you have the most recent version, it returns a 304 with no message body.

However, as other answers have noted, the server doesn't have to tell you the truth, so you're sometimes stuck downloading the content and checking yourself. Many dynamic things will always claim to have new content because many developers have never thought about supporting basic HTTP things in their web apps.

For the LWP bits, you can create a single request with an extra header:

use HTTP::Request;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
my $request = HTTP::Request->new( GET => $url );
$r->header( 'If-Modified-Since' => $time );

$ua->request( $request );

For all requests, you can set a request handler:

$ua->add_handler(
    request_send => sub { 
        my($request, $ua, $h) = @_; 
        # ... look up time from local store
        $r->header( 'If-Modified-Since' => $time );
        }
    );

However, LWP can do most of this for you with mirror if you want to save the files:

$ua->mirror( $url, $filename )


来源:https://stackoverflow.com/questions/10201009/how-to-detect-a-changed-webpage

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!