I know that it is easy to match anything except a given character using a regular expression.
$text = \"ab ac ad\";
$text =~ s/[^c]*//g; # Match anything, except
Update: In a comment on your question, you mentioned you want to clean wiki markup and remove balanced sequences of {{
... }}
. Section 6 of the Perl FAQ covers this: Can I use Perl regular expressions to match balanced text?
Consider the following program:
#! /usr/bin/perl
use warnings;
use strict;
use Text::Balanced qw/ extract_tagged /;
# for demo only
*ARGV = *DATA;
while (<>) {
if (s/^(.+?)(?=\{\{)//) {
print $1;
my(undef,$after) = extract_tagged $_, "{{" => "}}";
if (defined $after) {
$_ = $after;
redo;
}
}
print;
}
__DATA__
Lorem ipsum dolor sit amet, consectetur
adipiscing elit. {{delete me}} Sed quis
nulla ut dolor {{me too}} fringilla
mollis {{ quis {{ ac }} erat.
Its output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed quis nulla ut dolor fringilla mollis {{ quis erat.
For your particular example, you could use
$text =~ s/[^ac]|a(?!c)|(?<!a)c//g;
That is, only delete an a
or c
when they aren't part of an ac
sequence.
In general, this is tricky to do with a regular expression.
Say you don't want foo
followed by optional whitespace and then bar
in $str
. Often, it's clearer and easier to check separately. For example:
die "invalid string ($str)"
if $str =~ /^.*foo\s*bar/;
You might also be interested in an answer to a similar question, where I wrote
my $nofoo = qr/
( [^f] |
f (?! o) |
fo (?! o \s* bar)
)*
/x;
my $pattern = qr/^ $nofoo bar /x;
To understand the complication, read How Regexes Work by Mark Dominus. The engine compiles regular expressions into state machines. When it's time to match, it feeds the input string to the state machine and checks whether the state machine finishes in an accept state. So to exclude a string, you have to specify a machine that accepts all inputs except a particular sequence.
What might help is a /v
regular expression switch that creates the state machine as usual but then complements the accept-state bit for all states. It's hard to say whether this would really be useful as compared with separate checks because a /v
regular expression may still surprise people, just in different ways.
If you're interested in the theoretical details, see An Introduction to Formal Languages and Automata by Peter Linz.
$text =~ s/[^c]*//g; // Match anything, except c.
@ssn, A couple of comments about your question:
How would I "match anything, except 'ac'" ? Tried [^(ac)] and [^"ac"] without success.
Please read the documentation on character classes(See "perldoc perlre" on your command line, or online at http://perldoc.perl.org/perlre.html ) - you'll see it states that for the list of characters within the square brackets the RE will "match any character from the list". Meaning order is not relevant and there are no "strings", only a list of characters. "()" and double quotes also have no special meaning inside the square brackets.
Now I'm not exactly sure why you're talking about matching but then giving an example of substitution. But to see if a string does not match the sub-string "ac" you just need to negate the match:
use strict; use warnings;
my $text = "ab ac ad";
if ($text !~ m/ac/) {
print "Yey the text doesn't match 'ac'!\n"; # this shouldn't be printed
}
Say you have a string of text within which are embedded multiple occurrences of a substring. If you just want the text surrounding the sub-string, just remove all occurrences of the sub-string:
$text =~ s/ac//g;
If you want the reverse - to remove all text except for all occurrences of the sub-string, I would suggest something like:
use strict; use warnings;
my $text = "ab ac ad ac ae";
my $sub_str = "ac";
my @captured = $text =~ m/($sub_str)/g;
my $num = scalar @captured;
print (($sub_str x $num) . "\n");
This basically counts the number of times the sub-string appears in the text and prints the sub-string that number of times using the "x" operator. Not very elegant, I'm sure a Perl-guru could come up with something better.
@ennuikiller:
my $text = "ab ac ad";
$text !~ s/(ac)//g; # Match anything, except ac.
This is incorrect, since it generates a warning ("Useless use of negative pattern binding (!~) in void context") under "use warnings" and doesn't do anything except remove all substrings "ac" from the text, which could be more simply written as I wrote above with:
$text =~ s/ac//g;
You can easily modify this regex for your purpose.
use Test::More 0.88;
#Match any whole text that does not contain a string
my $re=qr/^(?:(?!ac).)*$/;
my $str='ab ac ad';
ok(!$str=~$re);
$str='ab af ad';
ok($str=~$re);
done_testing();
The following solves the question as understood in the second sense described in Bart K. comment:
>> $text='ab ac ad';
>> $text =~ s/(ac)|./\1/g;
>> print $text;
ac
Also, 'abacadac'
-> 'acac'
It should be noted though that in most practical applications negative lookaheads prove to be more useful than this approach.
If you just want to check if the string does not contain "ac", just use a negation.
$text = "ab ac ad";
print "ac not found" if $text !~ /ac/;
or
print "ac not found" unless $text =~ /ac/;
you can use index()
$text = "ab ac ad";
print "ac not found" if ( index($text,"ac") == -1 );