perl multiline string regex

前端 未结 1 383
野性不改
野性不改 2021-01-15 04:29

I try to find all the strings (between \" or \') in a file by reading the file line by line.

my @strings = ();
open FILE, $file or die \"File operation faile         


        
相关标签:
1条回答
  • 2021-01-15 05:13

    The following are two regex's for parsing either single or double quotes. Note, that I've slurped all the data in order to be able to catch multiline strings:

    use strict;
    use warnings;
    
    my $squo_re = qr{'(?:(?>[^'\\]*)|\\.)*'};
    my $dquo_re = qr{"(?:(?>[^"\\]*)|\\.)*"};
    
    my $data = do {local $/; <DATA>};
    
    while ($data =~ /($squo_re|$dquo_re)/g) {
        print "<$1>\n";
    }
    
    __DATA__
    print $time . "single line \n";
    print "This is a
    multiline
    string";
    print 'single quote string';
    print "string with variable ".$time." after variable";
    

    However, because you're trying to parse perl code, the cleanest way of doing it will be to use PPI though:

    use strict;
    use warnings;
    
    use PPI;
    
    my $src = do {local $/; <DATA>};
    
    # Load a document
    my $doc = PPI::Document->new( \$src );
    
    # Find all the barewords within the doc
    my $strings = $doc->find( 'PPI::Token::Quote' );
    for (@$strings) {
        print '<', $_->content, ">\n";
    }
    
    __DATA__
    print $time . "single line \n";
    print "This is a
    multiline
    string";
    print 'single quote string';
    print "string with variable ".$time." after variable";
    

    Both methods output:

    <"single line \n">
    <"This is a
    multiline
    string">
    <'single quote string'>
    <"string with variable ">
    <" after variable">
    

    Update about (?> ... )

    The following is an annotated version of the double quote regular expression.

    my $dquo_re = qr{
        "
            (?:                # Non-capturing group - http://perldoc.perl.org/perlretut.html#Non-capturing-groupings
                (?>            # Independent Subexpression to prevent backtracking (this is for efficiency only) - http://perldoc.perl.org/perlretut.html#Using-independent-subexpressions-to-prevent-backtracking
                    [^"\\]*    # All characters NOT a " or \
                )
            |
                \\.            # Backslash followed by any escaped character
            )*                 # Any number of the preceeding or'd group
        "
        }x;
    

    The independent subexpression (?> ... ) it not actually required for this regex to work. It is intended to prevent backtracking because there is only one way for a quoted string to match, either we find a ending quote using the above rules or we don't.

    The subexpression is a lot more useful when dealing with a recursive regex, but I've always used it in this case. I'll have to benchmark at a later to to decide if it's actually just a premature optimization.

    Update about Comments

    To avoid comments, you can just use the PPI solution that I already proposed. It's meant to parse perl code and will already work as it is.

    However, given this is a lab assignment, a regex solution would be to setup a second capturing group in your loop for finding comments:

    while ($data =~ /($squo_re|$dquo_re)|($comment_re)/g) {
        my $quote = $1,
        my $comment = $2;
    
        if (defined $quote) {
            print "<$quote>\n";
        } elsif ($defined $comment) {
            print "Comment - $comment\n";
        }
    }
    

    The above will match either a quoted string or a comment. Which capture actually matched will be defined so you can know which was found. You will have to come up with the regular expression for finding a comment on your own though.

    0 讨论(0)
提交回复
热议问题