Emulation of lex like functionality in Perl or Python

前端未结

关注

 8  2035

梦毁少年i

Here\'s the deal. Is there a way to have strings tokenized in a line based on multiple regexes?

One example:

I have to get all href tags, their corresponding

相关标签:

8条回答

滥情空心

2021-01-14 00:10

From perlop:

A useful idiom for lex -like scanners is /\G.../gc . You can combine several regexps like this to process a string part-by-part, doing different actions depending on which regexp matched. Each regexp tries to match where the previous one leaves off.
 LOOP:
    {
      print(" digits"),       redo LOOP if /\G\d+\b[,.;]?\s*/gc;
      print(" lowercase"),    redo LOOP if /\G[a-z]+\b[,.;]?\s*/gc;
      print(" UPPERCASE"),    redo LOOP if /\G[A-Z]+\b[,.;]?\s*/gc;
      print(" Capitalized"),  redo LOOP if /\G[A-Z][a-z]+\b[,.;]?\s*/gc;
      print(" MiXeD"),        redo LOOP if /\G[A-Za-z]+\b[,.;]?\s*/gc;
      print(" alphanumeric"), redo LOOP if /\G[A-Za-z0-9]+\b[,.;]?\s*/gc;
      print(" line-noise"),   redo LOOP if /\G[^A-Za-z0-9]+/gc;
      print ". That's all!\n";
    }

0 讨论(0)

清歌不尽

2021-01-14 00:13

Sounds like you really just want to parse HTML, I recommend looking at any of the wonderful packages for doing so:

BeautifulSoup
lxml.html
html5lib

Or! You can use a parser like one of the following:

PyParsing
DParser - A GLR parser with good python bindings.
ANTLR - A recursive decent parser generator that can generate python code.

This example is from the BeautifulSoup Documentation:

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

links = SoupStrainer('a')
[tag for tag in BeautifulSoup(doc, parseOnlyThese=links)]
# [<a href="http://www.bob.com/">success</a>, 
#  <a href="http://www.bob.com/plasma">experiments</a>, 
#  <a href="http://www.boogabooga.net/">BoogaBooga</a>]

linksToBob = SoupStrainer('a', href=re.compile('bob.com/'))
[tag for tag in BeautifulSoup(doc, parseOnlyThese=linksToBob)]
# [<a href="http://www.bob.com/">success</a>, 
#  <a href="http://www.bob.com/plasma">experiments</a>]

0 讨论(0)

Happy的楠姐

2021-01-14 00:15

Look at documentation for following modules on CPAN

HTML::TreeBuilder

HTML::TableExtract

and

Parse::RecDescent

I've used these modules to process quite large and complex web-pages.

0 讨论(0)
发布评论:

提交评论
- 加载中...
被撕碎了的回忆

2021-01-14 00:15

If your problem has anything at all to do with web scraping, I recommend looking at Web::Scraper , which provides easy element selection via XPath respectively CSS selectors. I have a (German) talk on Web::Scraper , but if you run it through babelfish or just look at the code samples, that can help you to get a quick overview of the syntax.

Hand-parsing HTML is onerous and won't give you much over using one of the premade HTML parsers. If your HTML is of very limited variation, you can get by by using clever regular expressions, but if you're already breaking out hard-core parser tools, it sounds as if your HTML is far more regular than what is sane to parse with regular expressions.

0 讨论(0)
发布评论:

提交评论
- 加载中...

醉梦人生

2021-01-14 00:23

Also check out pQuery it as a really nice Perlish way of doing this kind of stuff....

use pQuery;

pQuery( 'http://www.perl.com' )->find( 'a' )->each( 
    sub {
        my $pQ = pQuery( $_ ); 
        say $pQ->text, ' -> ', $pQ->toHtml;
    }
);

# prints all HTML anchors on www.perl.com
# =>  link text -> anchor HTML

However if your requirement is beyond HTML/Web then here is the earlier "Hello World!" example in Parse::RecDescent...

use strict;
use warnings;
use Parse::RecDescent;

my $grammar = q{
    alpha : /\w+/
    sep   : /,|\s/
    end   : '!'
    greet : alpha sep alpha end { shift @item; return \@item }
};

my $parse = Parse::RecDescent->new( $grammar );
my $hello = "Hello, World!";
print "$hello -> @{ $parse->greet( $hello ) }";

# => Hello, World! -> Hello , World !

Probably too much of a large hammer to crack this nut ;-)

0 讨论(0)

予麋鹿

2021-01-14 00:30
If you're specifically after parsing links out of web-pages, then Perl's WWW::Mechanize module will figure things out for you in a very elegant fashion. Here's a sample program that grabs the first page of Stack Overflow and parses out all the links, printing their text and corresponding URLs:
```
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize;

my $mech = WWW::Mechanize->new;

$mech->get("http://stackoverflow.com/");

$mech->success or die "Oh no! Couldn't fetch stackoverflow.com";

foreach my $link ($mech->links) {
    print "* [",$link->text, "] points to ", $link->url, "\n";
}
```
In the main loop, each $link is a WWW::Mechanize::Link object, so you're not just constrained to getting the text and URL.

All the best,

Paul
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页