WWW::Mechanize Extraction Help - PERL

青春壹個敷衍的年華 提交于 2019-12-11 13:59:14

问题


I'm try to automate the extraction of a transcript found on a website. The entire transcript is found between dl tags since the site formatted the interview in a description list. The script I have below allows me to search the site and extract the text in a plain-text format, but I'm actually looking for it to include everything between the dl tags, meaning dd's, dt's, etc. This will allow us to develop our own CSS for the interview.

Something to note about the page is that there are break statements inserted at various points during the interview. Some tools we've found that extract information from webpages using pairings have found this to be a problem since it only grabs the information up until the break statement. Just something to keep in mind if you point me in a different direction. Here's what I have so far.

#!/usr/bin/perl -w

use strict;
use WWW::Mechanize;
use WWW::Mechanize::TreeBuilder;

my $mech = WWW::Mechanize->new();
WWW::Mechanize::TreeBuilder->meta->apply($mech);
$mech->get("http://millercenter.org/president/clinton/oralhistory/madeleine-k-albright");

# find all <dl> tags
my @list = $mech->find('dl');

foreach ( @list ) {
print $_->as_text();
}

If there is a tool that essentially prints what I have, only this time as HTML, please let me know of it!


回答1:


Your code is fine, just change the as_text() method to as_HTML() and it will show the content with HTML tags included.



来源:https://stackoverflow.com/questions/32337580/wwwmechanize-extraction-help-perl

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!