I\'m processing some Java source code using Java. I\'m extracting the string literals and feeding them to a function taking a String. The problem is that I need to pass the
You can use String unescapeJava(String) method of StringEscapeUtils from Apache Commons Lang.
Here's an example snippet:
String in = "a\\tb\\n\\\"c\\\"";
System.out.println(in);
// a\tb\n\"c\"
String out = StringEscapeUtils.unescapeJava(in);
System.out.println(out);
// a b
// "c"
The utility class has methods to escapes and unescape strings for Java, Java Script, HTML, XML, and SQL. It also has overloads that writes directly to a java.io.Writer.
It looks like StringEscapeUtils
handles Unicode escapes with one u
, but not octal escapes, or Unicode escapes with extraneous u
s.
/* Unicode escape test #1: PASS */
System.out.println(
"\u0030"
); // 0
System.out.println(
StringEscapeUtils.unescapeJava("\\u0030")
); // 0
System.out.println(
"\u0030".equals(StringEscapeUtils.unescapeJava("\\u0030"))
); // true
/* Octal escape test: FAIL */
System.out.println(
"\45"
); // %
System.out.println(
StringEscapeUtils.unescapeJava("\\45")
); // 45
System.out.println(
"\45".equals(StringEscapeUtils.unescapeJava("\\45"))
); // false
/* Unicode escape test #2: FAIL */
System.out.println(
"\uu0030"
); // 0
System.out.println(
StringEscapeUtils.unescapeJava("\\uu0030")
); // throws NestableRuntimeException:
// Unable to parse unicode value: u003
A quote from the JLS:
Octal escapes are provided for compatibility with C, but can express only Unicode values
\u0000
through\u00FF
, so Unicode escapes are usually preferred.
If your string can contain octal escapes, you may want to convert them to Unicode escapes first, or use another approach.
The extraneous u
is also documented as follows:
The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra
u
-for example,\uxxxx
becomes\uuxxxx
-while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.This transformed version is equally acceptable to a compiler for the Java programming language and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple
u
's are present to a sequence of Unicode characters with one feweru
, while simultaneously converting each escape sequence with a singleu
to the corresponding single Unicode character.
If your string can contain Unicode escapes with extraneous u
, then you may also need to preprocess this before using StringEscapeUtils
.
Alternatively you can try to write your own Java string literal unescaper from scratch, making sure to follow the exact JLS specifications.