Here on SO people sometimes say something like \"you cannot parse X with regular expressions, because X is not a regular language\". From my understanding however, modern re
Modern regex engines can certainly parse a bigger set of languages than the regular languages set. So said, none of the four classic Chomsky sets are exactly recognized by regexes. All regular languages are clearly recognized by regexes. There are some classic context-free languages that cannot be recognized by regexes, such as the balanced parenthesis language a^n b^n
, unless backreferences with counting are available. However, a regex can parse the language ww
which is context-sensitive.
Actually, regular expressions in formal language theory are only lightly related to regexes. Matching regexes with unlimited backreference is NP-Complete in the most general case, so all pattern matching algorithms for powerful enough regexes are exponential, at least in the general case. However most times for most input they are quite fast. It is known that matching context-free languages is at most something faster than n^3
, so there are some languages in regexes that are not context-free (like ww
) but not all context-free languages can be parsed by regexes. Type 0 languages are non-decidable in general, son regexes don't get there.
So as a not very conclusive conclusion, regexes can parse a broad set of languages that include all regular languages, and some context-free and context-sensitive, but it is not exactly equal to any of those sets. There are other categories of languages, and other taxonomies, where you could find a more precise answer, but no taxonomy that includes context-free languages as a proper subset in a hierarchy of languages can provide a single language exactly recognized by regexes, because regexes only intersect in some part with context-free languages, and neither is a proper subset of the other.
You can read about regexes in An Introduction to Language And Linguistics By Ralph W. Fasold, Jeff Connor-Linton P.477
Chomsky Hierarchy:
Type0 >= Type1 >= Type2 >= Type3
Computational Linguistics mainly features Type 2 & 3 Grammars
• Type 3 grammars:
–Include regular expressions and finite state automata (aka, finite state machines)
–The focal point of the rest of this talk
• Type 2 grammars:
–Commonly used for natural language parsers
–Used to model syntactic structure in many linguistic theories (often supplemented by other mechanisms)
–We will play a key roll in the next talk on parsing.
most XMLs like Microsoft DGML (Directed Graph Markup Language) that has inter-relational links are samples that Regex are useless.
and this three answers may be useful:
1 - does-lookaround-affect-which-languages-can-be-matched-by-regular-expressions
2 - regular-expressions-arent
3 - where-do-most-regex-implementations-fall-on-the-complexity-scale
I recently wrote a rather long article on this topic: The true power of regular expressions.
To summarize:
a^n b^n
).ww
and a^n b^n c^n
).Some examples:
Matching the context-free language {a^n b^n, n>0}
:
/^(a(?1)?b)$/
# or
/^ (?: a (?= a* (\1?+ b) ) )+ \1 $/x
Matching the context-sensitive language {a^n b^n c^n, n>0}
:
/^
(?=(a(?-1)?b)c)
a+(b(?-1)?c)
$/x
# or
/^ (?: a (?= a* (\1?+ b) b* (\2?+ c) ) )+ \1 \2 $/x