Emulation of lex like functionality in Perl or Python

前端 未结 8 2035
梦毁少年i
梦毁少年i 2021-01-13 23:46

Here\'s the deal. Is there a way to have strings tokenized in a line based on multiple regexes?

One example:

I have to get all href tags, their corresponding

相关标签:
8条回答
  • 2021-01-14 00:10

    From perlop:

    A useful idiom for lex -like scanners is /\G.../gc . You can combine several regexps like this to process a string part-by-part, doing different actions depending on which regexp matched. Each regexp tries to match where the previous one leaves off.

     LOOP:
        {
          print(" digits"),       redo LOOP if /\G\d+\b[,.;]?\s*/gc;
          print(" lowercase"),    redo LOOP if /\G[a-z]+\b[,.;]?\s*/gc;
          print(" UPPERCASE"),    redo LOOP if /\G[A-Z]+\b[,.;]?\s*/gc;
          print(" Capitalized"),  redo LOOP if /\G[A-Z][a-z]+\b[,.;]?\s*/gc;
          print(" MiXeD"),        redo LOOP if /\G[A-Za-z]+\b[,.;]?\s*/gc;
          print(" alphanumeric"), redo LOOP if /\G[A-Za-z0-9]+\b[,.;]?\s*/gc;
          print(" line-noise"),   redo LOOP if /\G[^A-Za-z0-9]+/gc;
          print ". That's all!\n";
        }
    
    0 讨论(0)
  • 2021-01-14 00:13

    Sounds like you really just want to parse HTML, I recommend looking at any of the wonderful packages for doing so:

    • BeautifulSoup
    • lxml.html
    • html5lib

    Or! You can use a parser like one of the following:

    • PyParsing
    • DParser - A GLR parser with good python bindings.
    • ANTLR - A recursive decent parser generator that can generate python code.

    This example is from the BeautifulSoup Documentation:

    from BeautifulSoup import BeautifulSoup, SoupStrainer
    import re
    
    links = SoupStrainer('a')
    [tag for tag in BeautifulSoup(doc, parseOnlyThese=links)]
    # [<a href="http://www.bob.com/">success</a>, 
    #  <a href="http://www.bob.com/plasma">experiments</a>, 
    #  <a href="http://www.boogabooga.net/">BoogaBooga</a>]
    
    linksToBob = SoupStrainer('a', href=re.compile('bob.com/'))
    [tag for tag in BeautifulSoup(doc, parseOnlyThese=linksToBob)]
    # [<a href="http://www.bob.com/">success</a>, 
    #  <a href="http://www.bob.com/plasma">experiments</a>]
    
    0 讨论(0)
  • 2021-01-14 00:15

    Look at documentation for following modules on CPAN

    HTML::TreeBuilder

    HTML::TableExtract

    and

    Parse::RecDescent

    I've used these modules to process quite large and complex web-pages.

    0 讨论(0)
  • 2021-01-14 00:15

    If your problem has anything at all to do with web scraping, I recommend looking at Web::Scraper , which provides easy element selection via XPath respectively CSS selectors. I have a (German) talk on Web::Scraper , but if you run it through babelfish or just look at the code samples, that can help you to get a quick overview of the syntax.

    Hand-parsing HTML is onerous and won't give you much over using one of the premade HTML parsers. If your HTML is of very limited variation, you can get by by using clever regular expressions, but if you're already breaking out hard-core parser tools, it sounds as if your HTML is far more regular than what is sane to parse with regular expressions.

    0 讨论(0)
  • 2021-01-14 00:23

    Also check out pQuery it as a really nice Perlish way of doing this kind of stuff....

    use pQuery;
    
    pQuery( 'http://www.perl.com' )->find( 'a' )->each( 
        sub {
            my $pQ = pQuery( $_ ); 
            say $pQ->text, ' -> ', $pQ->toHtml;
        }
    );
    
    # prints all HTML anchors on www.perl.com
    # =>  link text -> anchor HTML
    

    However if your requirement is beyond HTML/Web then here is the earlier "Hello World!" example in Parse::RecDescent...

    use strict;
    use warnings;
    use Parse::RecDescent;
    
    my $grammar = q{
        alpha : /\w+/
        sep   : /,|\s/
        end   : '!'
        greet : alpha sep alpha end { shift @item; return \@item }
    };
    
    my $parse = Parse::RecDescent->new( $grammar );
    my $hello = "Hello, World!";
    print "$hello -> @{ $parse->greet( $hello ) }";
    
    # => Hello, World! -> Hello , World !
    

    Probably too much of a large hammer to crack this nut ;-)

    0 讨论(0)
  • 2021-01-14 00:30

    If you're specifically after parsing links out of web-pages, then Perl's WWW::Mechanize module will figure things out for you in a very elegant fashion. Here's a sample program that grabs the first page of Stack Overflow and parses out all the links, printing their text and corresponding URLs:

    #!/usr/bin/perl
    use strict;
    use warnings;
    use WWW::Mechanize;
    
    my $mech = WWW::Mechanize->new;
    
    $mech->get("http://stackoverflow.com/");
    
    $mech->success or die "Oh no! Couldn't fetch stackoverflow.com";
    
    foreach my $link ($mech->links) {
        print "* [",$link->text, "] points to ", $link->url, "\n";
    }
    

    In the main loop, each $link is a WWW::Mechanize::Link object, so you're not just constrained to getting the text and URL.

    All the best,

    Paul

    0 讨论(0)
提交回复
热议问题