How to unescape a Java string literal in Java?

后端 未结 11 1787
庸人自扰
庸人自扰 2020-11-22 01:35

I\'m processing some Java source code using Java. I\'m extracting the string literals and feeding them to a function taking a String. The problem is that I need to pass the

11条回答
  •  渐次进展
    2020-11-22 02:20

    You can use String unescapeJava(String) method of StringEscapeUtils from Apache Commons Lang.

    Here's an example snippet:

        String in = "a\\tb\\n\\\"c\\\"";
    
        System.out.println(in);
        // a\tb\n\"c\"
    
        String out = StringEscapeUtils.unescapeJava(in);
    
        System.out.println(out);
        // a    b
        // "c"
    

    The utility class has methods to escapes and unescape strings for Java, Java Script, HTML, XML, and SQL. It also has overloads that writes directly to a java.io.Writer.


    Caveats

    It looks like StringEscapeUtils handles Unicode escapes with one u, but not octal escapes, or Unicode escapes with extraneous us.

        /* Unicode escape test #1: PASS */
        
        System.out.println(
            "\u0030"
        ); // 0
        System.out.println(
            StringEscapeUtils.unescapeJava("\\u0030")
        ); // 0
        System.out.println(
            "\u0030".equals(StringEscapeUtils.unescapeJava("\\u0030"))
        ); // true
        
        /* Octal escape test: FAIL */
        
        System.out.println(
            "\45"
        ); // %
        System.out.println(
            StringEscapeUtils.unescapeJava("\\45")
        ); // 45
        System.out.println(
            "\45".equals(StringEscapeUtils.unescapeJava("\\45"))
        ); // false
    
        /* Unicode escape test #2: FAIL */
        
        System.out.println(
            "\uu0030"
        ); // 0
        System.out.println(
            StringEscapeUtils.unescapeJava("\\uu0030")
        ); // throws NestableRuntimeException:
           //   Unable to parse unicode value: u003
    

    A quote from the JLS:

    Octal escapes are provided for compatibility with C, but can express only Unicode values \u0000 through \u00FF, so Unicode escapes are usually preferred.

    If your string can contain octal escapes, you may want to convert them to Unicode escapes first, or use another approach.

    The extraneous u is also documented as follows:

    The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u-for example, \uxxxx becomes \uuxxxx-while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.

    This transformed version is equally acceptable to a compiler for the Java programming language and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.

    If your string can contain Unicode escapes with extraneous u, then you may also need to preprocess this before using StringEscapeUtils.

    Alternatively you can try to write your own Java string literal unescaper from scratch, making sure to follow the exact JLS specifications.

    References

    • JLS 3.3 Unicode Escapes
    • JLS 3.10.6 Escape Sequences for Character and String Literals

提交回复
热议问题