It appears that POSIX splits regular expression implementations into two kinds: Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE).
Python re module
Except for some similarity in the syntax, re
module doesn't follow POSIX standard for regular expressions.
POSIX regular expression (which can be implemented with a DFA/NFA or even a backtracking engine) always finds the leftmost longest match, while re
module is a backtracking engine which finds the leftmost "earliest" match ("earliest" according to the search order defined by the regular expression).
The difference in the matching semantics can be observed in the case of matching (Prefix|PrefixSuffix)
against PrefixSuffix
.
PrefixSuffix
.re
engine (and many other backtracking regex engines) will match Prefix
only, since Prefix
is specified first in the alternation.The difference can also be seen in the case of matching (xxx|xxxxx)*
against xxxxxxxxxx
(a string of 10 x
's):
On Cygwin:
$ [[ "xxxxxxxxxx" =~ (xxx|xxxxx)* ]] && echo "${BASH_REMATCH[0]}"
xxxxxxxxxx
All 10 x
's are matched.
In Python:
>>> re.search(r'(?:xxx|xxxxx)*', 'xxxxxxxxxxx').group(0)
'xxxxxxxxx'
Only 9 x
's are matched, since it picks the first item in alternation xxx
in all 3 repetitions, and nothing forces it to backtrack and try the second item in alternation)
Apart from the difference in matching semantics, POSIX regular expression also define syntax for collating symbols, equivalence class expressions, and collation-based character range. These features greatly increase the expressive power of the regex.
Taking equivalence class expression as example, from the documentation:
An equivalence class expression shall represent the set of collating elements belonging to an equivalence class, as described in Collation Order. [...]. The class shall be expressed by enclosing any one of the collating elements in the equivalence class within bracket-equal (
"[="
and"=]"
) delimiters. For example, if'a'
,'à'
, and'â'
belong to the same equivalence class, then"[[=a=]b]"
,"[[=à=]b]"
, and"[[=â=]b]"
are each equivalent to"[aàâb]"
. [...]
Since these features heavily depend on the locale settings, the same regex may behave differently on different locale. It also depends on the locale data on the system for the collation order.
re
regular expression featuresre
borrows the syntax from Perl, but not all features in Perl regex are implemented in re
. Below are some regex features available in re
which is unavailable in POSIX regular expression:
Greedy/lazy quantifier, which specifies the order to expand a quantifier.
While people usually call the *
in POSIX greedy, it actually only specifies the lower bound and upper bound of the repetition in POSIX. The so-called "greedy" behavior is due to the leftmost longest match rule.
(?(id/name)yes-pattern|no-pattern)
\b
, \s
, \d
, \w
(some POSIX regular expression engine may implement these, since the standard leaves the behavior undefined for these cases)