How can I extract XML of a website and save in a file using Perl's LWP?

问题

How can I extract information from a website (http://tv.yahoo.com/listings) and then create an XML file out of it? I want to save it so to parse later and display information using JavaScript?

I am quite new to Perl and I have no idea about how to do it.

回答1:

Of course. The easiest way would be the Web::Scraper module. What it does is it lets you define scraper objects that consist of

hash key names,
XPath expressions that locate elements of interest,
and code to extract bits of data from them.

Scraper objects take a URL and return a hash of the extracted data. The extractor code for each key can itself be another scraper object, if necessary, so that you can define how to scrape repeated compound page elements: provide the XPath to find the compound element in an outer scraper, then provide a bunch more XPaths to pull out its individual bits in an inner scraper. The result is then automatically a nested data structure.

In short, you can very elegantly suck data from all over a page into a Perl data structure. In doing so, the full power of XPath + Perl is available for use against any page. Since the page is parsed with HTML::TreeBuilder, it does not matter how nasty a tagsoup it is. The resulting scraper scripts are much easier to maintain and far more tolerant of minor markup variations than regex-based scrapers.

Bad news: as yet, its documentation is almost non-existent, so you have to get by with googling for something like [miyagawa web::scraper] to find example scripts posted by the module’s author.

回答2:

While in general LWP::Simple or WWW::Mechanize and HTML::Tree are good ways to extract data from web pages, in this particular case (TV listings) there's a much easier way:

Use XMLTV with data from Schedules Direct. There is a small fee (US$20/year), but there are advantages:

The parsing code is already written for you (just use XMLTV;).
You won't be violating Yahoo's terms of service.
You won't have to deal with Yahoo actively trying to break your script. (They don't like automated scripts pulling down TV listings; see #2.)

回答3:

If you want to pass the information to Javascript, use Javascript Object Notation (JSON) instead of XML. There are plenty of Perl libraries, such as JSON::Any, that can handle that for you.

回答4:

tv.yahoo.com is not very semantic and not very easy to scrape! They're maybe better alternatives or feeds?

Using pQuery I can quickly get times & shows....

use pQuery;
pQuery( 'http://tv.yahoo.com/listings' )
    ->find( '.show' )->each(
        sub {
            my $n = shift;
            my $pQ = pQuery( $_ ); 
            say $pQ->text;
        }
    );

  # => 4:00pm - 6:30pm Local Programming

To scrape details a bit more u can try this....

use pQuery;
my @tv_progs;
pQuery( 'http://tv.yahoo.com/listings' )
    ->find( 'li div strong' )->each(
        sub {
            my $n = shift;
            my $pQ = pQuery( $_ ); 
            $tv_progs[ $n ]->{ time } = $pQ->text;
        }
    )
    ->end
    ->find( '.showTitle' )->each( 
        sub {
            my $n = shift;
            my $pQ = pQuery( $_ ); 
            $tv_progs[ $n ]->{ name } = $pQ->text;
        }
    );

for my $prog ( @tv_progs ) {
    say $prog->{name} . " @ " . $prog->{time};
}

   # => Local Programming @ 4:00pm - 6:30pm

And to get channel....

use pQuery;
pQuery( 'http://tv.yahoo.com/listings' )
->find( '.chhdr a' )->each(
    sub {
        my $n = shift;
        my $pQ = pQuery( $_ ); 
        say $pQ->text;
    }
);

  # => ABC

However matching back channel to programme info will require a bit of work ;-)

来源：https://stackoverflow.com/questions/221091/how-can-i-extract-xml-of-a-website-and-save-in-a-file-using-perls-lwp

标签

javascript

xml

perl

extract

lwp