Here\'s the deal. Is there a way to have strings tokenized in a line based on multiple regexes?
One example:
I have to get all href tags, their corresponding
From perlop:
A useful idiom for lex -like scanners is
/\G.../gc
. You can combine several regexps like this to process a string part-by-part, doing different actions depending on which regexp matched. Each regexp tries to match where the previous one leaves off.LOOP: { print(" digits"), redo LOOP if /\G\d+\b[,.;]?\s*/gc; print(" lowercase"), redo LOOP if /\G[a-z]+\b[,.;]?\s*/gc; print(" UPPERCASE"), redo LOOP if /\G[A-Z]+\b[,.;]?\s*/gc; print(" Capitalized"), redo LOOP if /\G[A-Z][a-z]+\b[,.;]?\s*/gc; print(" MiXeD"), redo LOOP if /\G[A-Za-z]+\b[,.;]?\s*/gc; print(" alphanumeric"), redo LOOP if /\G[A-Za-z0-9]+\b[,.;]?\s*/gc; print(" line-noise"), redo LOOP if /\G[^A-Za-z0-9]+/gc; print ". That's all!\n"; }
Sounds like you really just want to parse HTML, I recommend looking at any of the wonderful packages for doing so:
Or! You can use a parser like one of the following:
This example is from the BeautifulSoup Documentation:
from BeautifulSoup import BeautifulSoup, SoupStrainer
import re
links = SoupStrainer('a')
[tag for tag in BeautifulSoup(doc, parseOnlyThese=links)]
# [<a href="http://www.bob.com/">success</a>,
# <a href="http://www.bob.com/plasma">experiments</a>,
# <a href="http://www.boogabooga.net/">BoogaBooga</a>]
linksToBob = SoupStrainer('a', href=re.compile('bob.com/'))
[tag for tag in BeautifulSoup(doc, parseOnlyThese=linksToBob)]
# [<a href="http://www.bob.com/">success</a>,
# <a href="http://www.bob.com/plasma">experiments</a>]
Look at documentation for following modules on CPAN
HTML::TreeBuilder
HTML::TableExtract
and
Parse::RecDescent
I've used these modules to process quite large and complex web-pages.
If your problem has anything at all to do with web scraping, I recommend looking at Web::Scraper , which provides easy element selection via XPath respectively CSS selectors. I have a (German) talk on Web::Scraper , but if you run it through babelfish or just look at the code samples, that can help you to get a quick overview of the syntax.
Hand-parsing HTML is onerous and won't give you much over using one of the premade HTML parsers. If your HTML is of very limited variation, you can get by by using clever regular expressions, but if you're already breaking out hard-core parser tools, it sounds as if your HTML is far more regular than what is sane to parse with regular expressions.
Also check out pQuery it as a really nice Perlish way of doing this kind of stuff....
use pQuery;
pQuery( 'http://www.perl.com' )->find( 'a' )->each(
sub {
my $pQ = pQuery( $_ );
say $pQ->text, ' -> ', $pQ->toHtml;
}
);
# prints all HTML anchors on www.perl.com
# => link text -> anchor HTML
However if your requirement is beyond HTML/Web then here is the earlier "Hello World!" example in Parse::RecDescent...
use strict;
use warnings;
use Parse::RecDescent;
my $grammar = q{
alpha : /\w+/
sep : /,|\s/
end : '!'
greet : alpha sep alpha end { shift @item; return \@item }
};
my $parse = Parse::RecDescent->new( $grammar );
my $hello = "Hello, World!";
print "$hello -> @{ $parse->greet( $hello ) }";
# => Hello, World! -> Hello , World !
Probably too much of a large hammer to crack this nut ;-)
If you're specifically after parsing links out of web-pages, then Perl's WWW::Mechanize module will figure things out for you in a very elegant fashion. Here's a sample program that grabs the first page of Stack Overflow and parses out all the links, printing their text and corresponding URLs:
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
$mech->get("http://stackoverflow.com/");
$mech->success or die "Oh no! Couldn't fetch stackoverflow.com";
foreach my $link ($mech->links) {
print "* [",$link->text, "] points to ", $link->url, "\n";
}
In the main loop, each $link
is a WWW::Mechanize::Link object, so you're not just constrained to getting the text and URL.
All the best,
Paul