I recently noticed that, String.replaceAll(regex,replacement) behaves very weirdly when it comes to the escape-character \"\\\"(slash)
For example consider there is
One way around this problem is to replace backslash with another character, use that stand-in character for intermediate replacements, then convert it back into backslash at the end. For example, to convert "\r\n" to "\n":
String out = in.replace('\\','@').replaceAll("@r@n","@n").replace('@','\\');
Of course, that won't work very well if you choose a replacement character that can occur in the input string.
1) Let's say you want to replace a single \
using Java's replaceAll
method:
\
˪--- 1) the final backslash
2) Java's replaceAll
method takes a regex as first argument. In a regex literal, \
has a special meaning, e.g. in \d
which is a shortcut for [0-9]
(any digit). The way to escape a metachar in a regex literal is to precede it with a \
, which leads to:
\ \
| ˪--- 1) the final backslash
|
˪----- 2) the backslash needed to escape 1) in a regex literal
3) In Java, there is no regex literal: you write a regex in a string literal (unlike JavaScript for example, where you can write /\d+/
). But in a string literal, \
also has a special meaning, e.g. in \n
(a new line) or \t
(a tab). The way to escape a metachar in a string literal is to precede it with a \
, which leads to:
\\\\
|||˪--- 1) the final backslash
||˪---- 3) the backslash needed to escape 1) in a string literal
|˪----- 2) the backslash needed to escape 1) in a regex literal
˪------ 3) the backslash needed to escape 2) in a string literal
@Peter Lawrey's answer describes the mechanics. The "problem" is that backslash is an escape character in both Java string literals, and in the mini-language of regexes. So when you use a string literal to represent a regex, there are two sets of escaping to consider ... depending on what you want the regex to mean.
But why is it like that?
It is a historical thing. Java originally didn't have regexes at all. The syntax rules for Java String literals were borrowed from C / C++, which also didn't have built-in regex support. Awkwardness of double escaping didn't become apparent in Java until they added regex support in the form of the Pattern
class ... in Java 1.4.
So how do other languages manage to avoid this?
They do it by providing direct or indirect syntactic support for regexes in the programming language itself. For instance, in Perl, Ruby, Javascript and many other languages, there is a syntax for patterns / regexs (e.g. '/pattern/') where string literal escaping rules do not apply. In C# and Python, they provide an alternative "raw" string literal syntax in which backslashes are not escapes. (But note that if you use the normal C# / Python string syntax, you have the Java problem of double escaping.)
Why do
text.replaceAll("\n","/")
,text.replaceAll("\\n","/")
, andtext.replaceAll("\\\n","/")
all give the same output?
The first case is a newline character at the String level. The Java regex language treats all non-special characters as matching themselves.
The second case is a backslash followed by an "n" at the String level. The Java regex language interprets a backslash followed by an "n" as a newline.
The final case is a backslash followed by a newline character at the String level. The Java regex language doesn't recognize this as a specific (regex) escape sequence. However in the regex language, a backslash followed by any non-alphabetic character means the latter character. So, a backslash followed by a newline character ... means the same thing as a newline.
You need to esacpe twice, once for Java, once for the regex.
Java code is
"\\\\"
makes a regex string of
"\\" - two chars
but the regex needs an escape too so it turns into
\ - one symbol
This is because Java tries to give \
a special meaning in the replacement string, so that \$ will be a literal $ sign, but in the process they seem to have removed the actual special meaning of \
While text.replaceAll("\\\\","/")
, at least can be considered to be okay in some sense (though it itself is not absolutely right), all the three executions, text.replaceAll("\n","/")
, text.replaceAll("\\n","/")
, text.replaceAll("\\\n","/")
giving same output seem even more funny. It is just contradicting as to why they have restricted the functioning of text.replaceAll("\\","/")
for the same reason.
Java didn't mess up with regular expressions. It is because, Java likes to mess up with coders by trying to do something unique and different, when it is not at all required.
I think java really messed with regular expression in String.replaceAll();
Other than java I have never seen a language parse regular expression this way. You will be confused if you have used regex in some other languages.
In case of using the "\\"
in replacement string, you can use java.util.regex.Matcher.quoteReplacement(String)
String.replaceAll("/", Matcher.quoteReplacement("\\"));
By using this Matcher
class you can get the expected result.