I\'m reading Jan Goyvaerts\' \"Regular Expressions: The Complete Tutorial and Reference\" to touch up on my Regex.
In the second chapter, Jan has a section on \"spec
From experiments, it appears that unlike )
, the characters ]
and }
are only interpreted as delimiters when the corresponding opening [
or {
has been met.
Though IMO the same rule could apply to )
, that's the way it is.
This might be due to the way the parser was written: parenthesis can be nested so that the balancing needs to be checked, whereas brackets/curly braces are just flagged. (For instance, [[]
is a valid class definition. [[]]
is also a valid pattern but understood as [\[]\]
.)
Every where in a regular expression, regardless of engine and its standards, a parenthesis should be escaped to mean a literal character. Even the closing parenthesis. However, it doesn't apply to POSIX regular expressions:
)
The<right-parenthesis>
shall be special when matched with a preceding<left-parenthesis>
, both outside a bracket expression.
But the interesting part is that POSIX has a separate definition for a right-parenthesis for times it should be treated as a special character. It doesn't have it for }
or ]
.
Why other engines don't follow this rule?
Call it implementation peculiarities or historical reasons that have something to do with Perl as commented in PCRE source code:
/* It appears that Perl allows any characters whatsoever, other than
a closing parenthesis, to appear in arguments, so we no longer insist on
letters, digits, and underscores. */
It seems that with all that special clusters in more advanced engines treating a closing parenthesis as a special character will cost much less than implementing POSIX standard.
The following paragraphs give an answer. I'm citing from Jan's website, not from the book, though:
If you forget to escape a special character where its use is not allowed, such as in
+1
, then you will get an error message.Most regular expression flavors treat the brace
{
as a literal character, unless it is part of a repetition operator likea{1,3}
. So you generally do not need to escape it with a backslash, though you can do so if you want. But there are a few exceptions. Java requires literal opening braces to be escaped. Boost and std::regex require all literal braces to be escaped.
]
is a literal outside character classes. Different rules apply inside character classes. Those are discussed in the topic about character classes. Again, there are exceptions. std::regex and Ruby require closing square brackets to be escaped even outside character classes.
It seems like he uses "needs to be escaped" as his definition for "special character", and unlike )
, the ]
and }
characters need not be escaped in most flavours.
That said, you wouldn't be wrong calling them special characters as well. It's definitely a best practice to always escape them, and in no flavour \]
and \}
mean anything else than a literal ]
or }
.
On the other hand, they have their special meaning only inside a specific (parsing) context, namely when they follow [
and {
respectively. There are similar cases: :=><!#'&,
all have a non-literal meaning inside a specific context, and we wouldn't normally call these "special characters" either.
And while we could say the same about )
, almost no flavour allows for it to occur on its own outside of groups, because pairs of parentheses always need to match. Its only usage is in the special context, and therefore )
is considered a special character.
The regex flavors in my book do not require }
and ]
to be escaped (except for ]
in character classes in JavaScript). So I don't because I like to have as few backslashes in my regexes as possible. You can escape them if you find your regexes clearer that way.
First of all, anyone learning about regular expressions needs to understand the importance of the qualifier "In the regex flavors discussed in this tutorial..." You cannot discuss regular expressions without stating which regex flavor(s) you're talking about.
What I wrote is true for the flavors my book (2006 edition) discusses. In those flavors, )
is treated as a token that closes a group. It is a syntax error if used without a corresponding (
. So )
has a special meaning when used all on its own.
}
does not have a special meaning when used all on its own. You never need to escape it with these flavors. If you wanted to match something like {7}
or {7,42}
literally, you only need to escape the opening {
. If you want to argue that }
is special because it sometimes has a special meaning, then you would have to say the same about ,
which becomes special in the same situation.
]
does not have a special meaning outside character classes in these regex flavors. You never need to escape it outside character classes. The paragraph you quoted does not talk about special characters inside character classes. That's a totally different list (\
, ]
, ^
, and -
) discussed in a later chapter.
Now as to why: most regular expressions have plenty of backslashes already. My preferred style is to escape as few characters as needed. So I never escape }
. I escape ]
in character classes when using JavaScript because that's the only way. But with other flavors I place ]
at the start of the character class or after the negating caret so I don't need to escape it. My teaching materials teach this style. When my products RegexBuddy or RegexMagic convert or generate regular expressions, they also use as few backslashes as needed.
I often see people new to regular expressions needlessly escape characters like "
, '
, or /
because they need to be escaped when the regular expression is quoted as a source code literal in certain programming languages. But the regular expression itself does not require these to be escaped.
I even see people escape characters like <
or >
. This is a bad habit because in some regex flavors \<
and \>
are word boundaries. This includes recent versions of PCRE (but not the PCRE that was current in 2006).
But, if you find it confusing to see unescaped }
and ]
used as literals, you are free to escape them in your regexes. Except for <
and >
, all the flavors discussed in my book allow you to escape any punctuation character to match that character literally, even if the character on its own would be a literal already.
So somebody saying that }
and ]
are special characters in regular expressions is not wrong if "special characters" means "characters that have a special meaning either on their own or when used in combination with other characters". But that list would also include ,
(quantifier), :
(non-capturing group), -
(mode modifier), !
(negative lookaround), <
(lookbehind), and -
(character class range).
But if "special characters" means "characters that have a special meaning on their own", then }
and ]
are not included in the list for the flavors my book covers.