I know that it is easy to match anything except a given character using a regular expression.
$text = \"ab ac ad\";
$text =~ s/[^c]*//g; # Match anything, except
Update: In a comment on your question, you mentioned you want to clean wiki markup and remove balanced sequences of {{
... }}
. Section 6 of the Perl FAQ covers this: Can I use Perl regular expressions to match balanced text?
Consider the following program:
#! /usr/bin/perl
use warnings;
use strict;
use Text::Balanced qw/ extract_tagged /;
# for demo only
*ARGV = *DATA;
while (<>) {
if (s/^(.+?)(?=\{\{)//) {
print $1;
my(undef,$after) = extract_tagged $_, "{{" => "}}";
if (defined $after) {
$_ = $after;
redo;
}
}
print;
}
__DATA__
Lorem ipsum dolor sit amet, consectetur
adipiscing elit. {{delete me}} Sed quis
nulla ut dolor {{me too}} fringilla
mollis {{ quis {{ ac }} erat.
Its output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed quis nulla ut dolor fringilla mollis {{ quis erat.
For your particular example, you could use
$text =~ s/[^ac]|a(?!c)|(?
That is, only delete an a
or c
when they aren't part of an ac
sequence.
In general, this is tricky to do with a regular expression.
Say you don't want foo
followed by optional whitespace and then bar
in $str
. Often, it's clearer and easier to check separately. For example:
die "invalid string ($str)"
if $str =~ /^.*foo\s*bar/;
You might also be interested in an answer to a similar question, where I wrote
my $nofoo = qr/
( [^f] |
f (?! o) |
fo (?! o \s* bar)
)*
/x;
my $pattern = qr/^ $nofoo bar /x;
To understand the complication, read How Regexes Work by Mark Dominus. The engine compiles regular expressions into state machines. When it's time to match, it feeds the input string to the state machine and checks whether the state machine finishes in an accept state. So to exclude a string, you have to specify a machine that accepts all inputs except a particular sequence.
What might help is a /v
regular expression switch that creates the state machine as usual but then complements the accept-state bit for all states. It's hard to say whether this would really be useful as compared with separate checks because a /v
regular expression may still surprise people, just in different ways.
If you're interested in the theoretical details, see An Introduction to Formal Languages and Automata by Peter Linz.