I'm having difficulty scraping dates from a specific web page because the date is apparently an argument passed to a javascript function. I have in the past written a few simple scrapers without any major issues so I didn't expect problems but I am struggling with this. The page has 5-6 dates in regular yyyy/mm/dd format like this dateFormat('2012/02/07')
Ideally I would like to remove everything except the half-dozen dates, which I want to save in an array. At this point, I can't even successfully get one date, let alone all of them. It is probably just a malformed regex that I have been looking it so long that I can't spot any more.
Q1. Why am I not getting a match with the regex below?
Q2. Following on from the above question how can I scrape all the dates into an array? I was thinking of assuming x number of dates on the page, for-looping x times and assigning the captured group to an array each loop, but that seems rather clunky.
Problem code follows.
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Tree;
my $url_full = "http://www.tse.or.jp/english/market/STATISTICS/e06_past.html";
my $content = get($url_full);
#dateFormat('2012/02/07');
$content =~ s/.*dateFormat\('(\d{4}\/\d{2}\/\d{2}\s{2})'\);.*/$1/; # get any date without regard to greediness etc
Why do you have two whitespace characters in your pattern?
$content =~ s/.*dateFormat\('(\d{4}\/\d{2}\/\d{2}\s{2})'\);.*/$1/;
^^^^^
they are not in your format example 'dateFormat('2012/02/07')'
I would say this is the reason why your pattern does not match.
Capture all dates
You can simply get all matches into an array like this
( my @Result ) = $content =~ /(?<=dateFormat\(')\d{4}\/\d{2}\/\d{2}(?='\))/g;
(?<=dateFormat\(')
is a positive lookbehind assertion that ensures that there is dateFormat\('
before your date pattern (but this is not included in your match)
(?='\))
is a positive lookahead assertion that ensures that there is '\)
after the pattern
The g
modifier let your pattern search for all matches in the string.
来源:https://stackoverflow.com/questions/9190107/how-to-scrape-using-lwp-and-a-regex-the-date-argument-to-a-javascript-function