Most UNIX regular expressions have, besides the usual **
,+
,?*
operators a backslash operator where \1,\2,...
match whatever's in the last parentheses, so for example *L=(a*)b\1*
matches the (non regular) language *a^n b a^n*
.
On one hand, this seems to be pretty powerful since you can create (a*)b\1b\1
to match the language *a^n b a^n b a^n*
which can't even be recognized by a stack automaton. On the other hand, I'm pretty sure *a^n b^n*
cannot be expressed this way.
I have two questions:
- Is there any literature on this family of languages (UNIX-y regular). In particular, is there a version of the pumping lemma for these?
- Can someone prove, or disprove, that
*a^n b^n*
cannot be expressed this way?
You're probably looking for
- Benjamin Carle and Paliath Narendran "On Extended Regular Expressions" LNCS 5457
- C. Campeanu, K. Salomaa, S. Yu: A formal study of practical regular expressions, International Journal of Foundations of Computer Science, Vol. 14 (2003) 1007 - 1018.
and of course follow their citations forward and backward to find more literature on this subject.
a^n b^n is CFL. The grammar is
A -> aAb | e
you can use pumping lemma for RL to prove A is not RL
Ruby 1.9.1 supports the following regex:
regex = %r{ (?<foo> a\g<foo>a | b\g<foo>b | c) }x
p regex.match("aaacbbb")
# the result is #<MatchData "c" foo:"c">
"Fun with Ruby 1.9 Regular Expressions" has an example where he actually arranges all the parts of a regex so that it looks like a context-free grammar as follows:
sentence = %r{
(?<subject> cat | dog | gerbil ){0}
(?<verb> eats | drinks| generates ){0}
(?<object> water | bones | PDFs ){0}
(?<adjective> big | small | smelly ){0}
(?<opt_adj> (\g<adjective>\s)? ){0}
The\s\g<opt_adj>\g<subject>\s\g<verb>\s\g<opt_adj>\g<object>
}x
I think this means that at least Ruby 1.9.1's regex engine, which is the Oniguruma regex engine, is actually equivalent to a context-free grammar, though the capturing groups aren't as useful as an actual parser-generator.
This means that "Pumping lemma for context-free languages" should describe the class of languages recognizable by Ruby 1.9.1's regex engine.
EDIT: Whoops! I messed up, and didn't do an important test which actually makes my answer above totally wrong. I won't delete the answer, because it's useful information nonetheless.
regex = %r{\A(?<foo> a\g<foo>a | b\g<foo>b | c)\Z}x
#I added anchors for the beginning and end of the string
regex.match("aaacbbb")
#returns nil, indicating that no match is possible with recursive capturing groups.
EDIT: Coming back to this many months later, I just discovered that my test in the last edit was incorrect. "aaacbbb"
shouldn't be expected to match regex
even if regex
does operate like a context-free grammar.
The correct test should be on a string like "aabcbaa"
, and that does match the regex:
regex = %r{\A(?<foo> a\g<foo>a | b\g<foo>b | c)\Z}x
regex.match("aaacaaa")
# => #<MatchData "aaacaaa" foo:"aaacaaa">
regex.match("aacaa")
# => #<MatchData "aacaa" foo:"aacaa">
regex.match("aabcbaa")
# => #<MatchData "aabcbaa" foo:"aabcbaa">
来源:https://stackoverflow.com/questions/2626605/generalizing-the-pumping-lemma-for-unix-style-regular-expressions