I try to find all the strings (between \" or \') in a file by reading the file line by line.
my @strings = ();
open FILE, $file or die \"File operation faile
The following are two regex's for parsing either single or double quotes. Note, that I've slurped all the data in order to be able to catch multiline strings:
use strict;
use warnings;
my $squo_re = qr{'(?:(?>[^'\\]*)|\\.)*'};
my $dquo_re = qr{"(?:(?>[^"\\]*)|\\.)*"};
my $data = do {local $/; <DATA>};
while ($data =~ /($squo_re|$dquo_re)/g) {
print "<$1>\n";
}
__DATA__
print $time . "single line \n";
print "This is a
multiline
string";
print 'single quote string';
print "string with variable ".$time." after variable";
However, because you're trying to parse perl code, the cleanest way of doing it will be to use PPI though:
use strict;
use warnings;
use PPI;
my $src = do {local $/; <DATA>};
# Load a document
my $doc = PPI::Document->new( \$src );
# Find all the barewords within the doc
my $strings = $doc->find( 'PPI::Token::Quote' );
for (@$strings) {
print '<', $_->content, ">\n";
}
__DATA__
print $time . "single line \n";
print "This is a
multiline
string";
print 'single quote string';
print "string with variable ".$time." after variable";
Both methods output:
<"single line \n">
<"This is a
multiline
string">
<'single quote string'>
<"string with variable ">
<" after variable">
Update about (?> ... )
The following is an annotated version of the double quote regular expression.
my $dquo_re = qr{
"
(?: # Non-capturing group - http://perldoc.perl.org/perlretut.html#Non-capturing-groupings
(?> # Independent Subexpression to prevent backtracking (this is for efficiency only) - http://perldoc.perl.org/perlretut.html#Using-independent-subexpressions-to-prevent-backtracking
[^"\\]* # All characters NOT a " or \
)
|
\\. # Backslash followed by any escaped character
)* # Any number of the preceeding or'd group
"
}x;
The independent subexpression (?> ... )
it not actually required for this regex to work. It is intended to prevent backtracking because there is only one way for a quoted string to match, either we find a ending quote using the above rules or we don't.
The subexpression is a lot more useful when dealing with a recursive regex, but I've always used it in this case. I'll have to benchmark at a later to to decide if it's actually just a premature optimization.
Update about Comments
To avoid comments, you can just use the PPI
solution that I already proposed. It's meant to parse perl code and will already work as it is.
However, given this is a lab assignment, a regex solution would be to setup a second capturing group in your loop for finding comments:
while ($data =~ /($squo_re|$dquo_re)|($comment_re)/g) {
my $quote = $1,
my $comment = $2;
if (defined $quote) {
print "<$quote>\n";
} elsif ($defined $comment) {
print "Comment - $comment\n";
}
}
The above will match either a quoted string or a comment. Which capture actually matched will be defined so you can know which was found. You will have to come up with the regular expression for finding a comment on your own though.