How to unescape a Java string literal in Java?

后端 未结 11 1786
庸人自扰
庸人自扰 2020-11-22 01:35

I\'m processing some Java source code using Java. I\'m extracting the string literals and feeding them to a function taking a String. The problem is that I need to pass the

相关标签:
11条回答
  • 2020-11-22 02:14

    org.apache.commons.lang3.StringEscapeUtils from commons-lang3 is marked deprecated now. You can use org.apache.commons.text.StringEscapeUtils#unescapeJava(String) instead. It requires an additional Maven dependency:

            <dependency>
                <groupId>org.apache.commons</groupId>
                <artifactId>commons-text</artifactId>
                <version>1.4</version>
            </dependency>
    

    and seems to handle some more special cases, it e.g. unescapes:

    • escaped backslashes, single and double quotes
    • escaped octal and unicode values
    • \\b, \\n, \\t, \\f, \\r
    0 讨论(0)
  • 2020-11-22 02:15

    I know this question was old, but I wanted a solution that doesn't involve libraries outside those included JRE6 (i.e. Apache Commons is not acceptable), and I came up with a simple solution using the built-in java.io.StreamTokenizer:

    import java.io.*;
    
    // ...
    
    String literal = "\"Has \\\"\\\\\\\t\\\" & isn\\\'t \\\r\\\n on 1 line.\"";
    StreamTokenizer parser = new StreamTokenizer(new StringReader(literal));
    String result;
    try {
      parser.nextToken();
      if (parser.ttype == '"') {
        result = parser.sval;
      }
      else {
        result = "ERROR!";
      }
    }
    catch (IOException e) {
      result = e.toString();
    }
    System.out.println(result);
    

    Output:

    Has "\  " & isn't
     on 1 line.
    
    0 讨论(0)
  • 2020-11-22 02:16

    The Problem

    The org.apache.commons.lang.StringEscapeUtils.unescapeJava() given here as another answer is really very little help at all.

    • It forgets about \0 for null.
    • It doesn’t handle octal at all.
    • It can’t handle the sorts of escapes admitted by the java.util.regex.Pattern.compile() and everything that uses it, including \a, \e, and especially \cX.
    • It has no support for logical Unicode code points by number, only for UTF-16.
    • This looks like UCS-2 code, not UTF-16 code: they use the depreciated charAt interface instead of the codePoint interface, thus promulgating the delusion that a Java char is guaranteed to hold a Unicode character. It’s not. They only get away with this because no UTF-16 surrogate will wind up looking for anything they’re looking for.

    The Solution

    I wrote a string unescaper which solves the OP’s question without all the irritations of the Apache code.

    /*
     *
     * unescape_perl_string()
     *
     *      Tom Christiansen <tchrist@perl.com>
     *      Sun Nov 28 12:55:24 MST 2010
     *
     * It's completely ridiculous that there's no standard
     * unescape_java_string function.  Since I have to do the
     * damn thing myself, I might as well make it halfway useful
     * by supporting things Java was too stupid to consider in
     * strings:
     * 
     *   => "?" items  are additions to Java string escapes
     *                 but normal in Java regexes
     *
     *   => "!" items  are also additions to Java regex escapes
     *   
     * Standard singletons: ?\a ?\e \f \n \r \t
     * 
     *      NB: \b is unsupported as backspace so it can pass-through
     *          to the regex translator untouched; I refuse to make anyone
     *          doublebackslash it as doublebackslashing is a Java idiocy
     *          I desperately wish would die out.  There are plenty of
     *          other ways to write it:
     *
     *              \cH, \12, \012, \x08 \x{8}, \u0008, \U00000008
     *
     * Octal escapes: \0 \0N \0NN \N \NN \NNN
     *    Can range up to !\777 not \377
     *    
     *      TODO: add !\o{NNNNN}
     *          last Unicode is 4177777
     *          maxint is 37777777777
     *
     * Control chars: ?\cX
     *      Means: ord(X) ^ ord('@')
     *
     * Old hex escapes: \xXX
     *      unbraced must be 2 xdigits
     *
     * Perl hex escapes: !\x{XXX} braced may be 1-8 xdigits
     *       NB: proper Unicode never needs more than 6, as highest
     *           valid codepoint is 0x10FFFF, not maxint 0xFFFFFFFF
     *
     * Lame Java escape: \[IDIOT JAVA PREPROCESSOR]uXXXX must be
     *                   exactly 4 xdigits;
     *
     *       I can't write XXXX in this comment where it belongs
     *       because the damned Java Preprocessor can't mind its
     *       own business.  Idiots!
     *
     * Lame Python escape: !\UXXXXXXXX must be exactly 8 xdigits
     * 
     * TODO: Perl translation escapes: \Q \U \L \E \[IDIOT JAVA PREPROCESSOR]u \l
     *       These are not so important to cover if you're passing the
     *       result to Pattern.compile(), since it handles them for you
     *       further downstream.  Hm, what about \[IDIOT JAVA PREPROCESSOR]u?
     *
     */
    
    public final static
    String unescape_perl_string(String oldstr) {
    
        /*
         * In contrast to fixing Java's broken regex charclasses,
         * this one need be no bigger, as unescaping shrinks the string
         * here, where in the other one, it grows it.
         */
    
        StringBuffer newstr = new StringBuffer(oldstr.length());
    
        boolean saw_backslash = false;
    
        for (int i = 0; i < oldstr.length(); i++) {
            int cp = oldstr.codePointAt(i);
            if (oldstr.codePointAt(i) > Character.MAX_VALUE) {
                i++; /****WE HATES UTF-16! WE HATES IT FOREVERSES!!!****/
            }
    
            if (!saw_backslash) {
                if (cp == '\\') {
                    saw_backslash = true;
                } else {
                    newstr.append(Character.toChars(cp));
                }
                continue; /* switch */
            }
    
            if (cp == '\\') {
                saw_backslash = false;
                newstr.append('\\');
                newstr.append('\\');
                continue; /* switch */
            }
    
            switch (cp) {
    
                case 'r':  newstr.append('\r');
                           break; /* switch */
    
                case 'n':  newstr.append('\n');
                           break; /* switch */
    
                case 'f':  newstr.append('\f');
                           break; /* switch */
    
                /* PASS a \b THROUGH!! */
                case 'b':  newstr.append("\\b");
                           break; /* switch */
    
                case 't':  newstr.append('\t');
                           break; /* switch */
    
                case 'a':  newstr.append('\007');
                           break; /* switch */
    
                case 'e':  newstr.append('\033');
                           break; /* switch */
    
                /*
                 * A "control" character is what you get when you xor its
                 * codepoint with '@'==64.  This only makes sense for ASCII,
                 * and may not yield a "control" character after all.
                 *
                 * Strange but true: "\c{" is ";", "\c}" is "=", etc.
                 */
                case 'c':   {
                    if (++i == oldstr.length()) { die("trailing \\c"); }
                    cp = oldstr.codePointAt(i);
                    /*
                     * don't need to grok surrogates, as next line blows them up
                     */
                    if (cp > 0x7f) { die("expected ASCII after \\c"); }
                    newstr.append(Character.toChars(cp ^ 64));
                    break; /* switch */
                }
    
                case '8':
                case '9': die("illegal octal digit");
                          /* NOTREACHED */
    
        /*
         * may be 0 to 2 octal digits following this one
         * so back up one for fallthrough to next case;
         * unread this digit and fall through to next case.
         */
                case '1':
                case '2':
                case '3':
                case '4':
                case '5':
                case '6':
                case '7': --i;
                          /* FALLTHROUGH */
    
                /*
                 * Can have 0, 1, or 2 octal digits following a 0
                 * this permits larger values than octal 377, up to
                 * octal 777.
                 */
                case '0': {
                    if (i+1 == oldstr.length()) {
                        /* found \0 at end of string */
                        newstr.append(Character.toChars(0));
                        break; /* switch */
                    }
                    i++;
                    int digits = 0;
                    int j;
                    for (j = 0; j <= 2; j++) {
                        if (i+j == oldstr.length()) {
                            break; /* for */
                        }
                        /* safe because will unread surrogate */
                        int ch = oldstr.charAt(i+j);
                        if (ch < '0' || ch > '7') {
                            break; /* for */
                        }
                        digits++;
                    }
                    if (digits == 0) {
                        --i;
                        newstr.append('\0');
                        break; /* switch */
                    }
                    int value = 0;
                    try {
                        value = Integer.parseInt(
                                    oldstr.substring(i, i+digits), 8);
                    } catch (NumberFormatException nfe) {
                        die("invalid octal value for \\0 escape");
                    }
                    newstr.append(Character.toChars(value));
                    i += digits-1;
                    break; /* switch */
                } /* end case '0' */
    
                case 'x':  {
                    if (i+2 > oldstr.length()) {
                        die("string too short for \\x escape");
                    }
                    i++;
                    boolean saw_brace = false;
                    if (oldstr.charAt(i) == '{') {
                            /* ^^^^^^ ok to ignore surrogates here */
                        i++;
                        saw_brace = true;
                    }
                    int j;
                    for (j = 0; j < 8; j++) {
    
                        if (!saw_brace && j == 2) {
                            break;  /* for */
                        }
    
                        /*
                         * ASCII test also catches surrogates
                         */
                        int ch = oldstr.charAt(i+j);
                        if (ch > 127) {
                            die("illegal non-ASCII hex digit in \\x escape");
                        }
    
                        if (saw_brace && ch == '}') { break; /* for */ }
    
                        if (! ( (ch >= '0' && ch <= '9')
                                    ||
                                (ch >= 'a' && ch <= 'f')
                                    ||
                                (ch >= 'A' && ch <= 'F')
                              )
                           )
                        {
                            die(String.format(
                                "illegal hex digit #%d '%c' in \\x", ch, ch));
                        }
    
                    }
                    if (j == 0) { die("empty braces in \\x{} escape"); }
                    int value = 0;
                    try {
                        value = Integer.parseInt(oldstr.substring(i, i+j), 16);
                    } catch (NumberFormatException nfe) {
                        die("invalid hex value for \\x escape");
                    }
                    newstr.append(Character.toChars(value));
                    if (saw_brace) { j++; }
                    i += j-1;
                    break; /* switch */
                }
    
                case 'u': {
                    if (i+4 > oldstr.length()) {
                        die("string too short for \\u escape");
                    }
                    i++;
                    int j;
                    for (j = 0; j < 4; j++) {
                        /* this also handles the surrogate issue */
                        if (oldstr.charAt(i+j) > 127) {
                            die("illegal non-ASCII hex digit in \\u escape");
                        }
                    }
                    int value = 0;
                    try {
                        value = Integer.parseInt( oldstr.substring(i, i+j), 16);
                    } catch (NumberFormatException nfe) {
                        die("invalid hex value for \\u escape");
                    }
                    newstr.append(Character.toChars(value));
                    i += j-1;
                    break; /* switch */
                }
    
                case 'U': {
                    if (i+8 > oldstr.length()) {
                        die("string too short for \\U escape");
                    }
                    i++;
                    int j;
                    for (j = 0; j < 8; j++) {
                        /* this also handles the surrogate issue */
                        if (oldstr.charAt(i+j) > 127) {
                            die("illegal non-ASCII hex digit in \\U escape");
                        }
                    }
                    int value = 0;
                    try {
                        value = Integer.parseInt(oldstr.substring(i, i+j), 16);
                    } catch (NumberFormatException nfe) {
                        die("invalid hex value for \\U escape");
                    }
                    newstr.append(Character.toChars(value));
                    i += j-1;
                    break; /* switch */
                }
    
                default:   newstr.append('\\');
                           newstr.append(Character.toChars(cp));
               /*
                * say(String.format(
                *       "DEFAULT unrecognized escape %c passed through",
                *       cp));
                */
                           break; /* switch */
    
            }
            saw_backslash = false;
        }
    
        /* weird to leave one at the end */
        if (saw_backslash) {
            newstr.append('\\');
        }
    
        return newstr.toString();
    }
    
    /*
     * Return a string "U+XX.XXX.XXXX" etc, where each XX set is the
     * xdigits of the logical Unicode code point. No bloody brain-damaged
     * UTF-16 surrogate crap, just true logical characters.
     */
     public final static
     String uniplus(String s) {
         if (s.length() == 0) {
             return "";
         }
         /* This is just the minimum; sb will grow as needed. */
         StringBuffer sb = new StringBuffer(2 + 3 * s.length());
         sb.append("U+");
         for (int i = 0; i < s.length(); i++) {
             sb.append(String.format("%X", s.codePointAt(i)));
             if (s.codePointAt(i) > Character.MAX_VALUE) {
                 i++; /****WE HATES UTF-16! WE HATES IT FOREVERSES!!!****/
             }
             if (i+1 < s.length()) {
                 sb.append(".");
             }
         }
         return sb.toString();
     }
    
    private static final
    void die(String foa) {
        throw new IllegalArgumentException(foa);
    }
    
    private static final
    void say(String what) {
        System.out.println(what);
    }
    

    If it helps others, you’re welcome to it — no strings attached. If you improve it, I’d love for you to mail me your enhancements, but you certainly don’t have to.

    0 讨论(0)
  • 2020-11-22 02:18

    See this from http://commons.apache.org/lang/:

    StringEscapeUtils

    StringEscapeUtils.unescapeJava(String str)

    0 讨论(0)
  • 2020-11-22 02:20

    You can use String unescapeJava(String) method of StringEscapeUtils from Apache Commons Lang.

    Here's an example snippet:

        String in = "a\\tb\\n\\\"c\\\"";
    
        System.out.println(in);
        // a\tb\n\"c\"
    
        String out = StringEscapeUtils.unescapeJava(in);
    
        System.out.println(out);
        // a    b
        // "c"
    

    The utility class has methods to escapes and unescape strings for Java, Java Script, HTML, XML, and SQL. It also has overloads that writes directly to a java.io.Writer.


    Caveats

    It looks like StringEscapeUtils handles Unicode escapes with one u, but not octal escapes, or Unicode escapes with extraneous us.

        /* Unicode escape test #1: PASS */
        
        System.out.println(
            "\u0030"
        ); // 0
        System.out.println(
            StringEscapeUtils.unescapeJava("\\u0030")
        ); // 0
        System.out.println(
            "\u0030".equals(StringEscapeUtils.unescapeJava("\\u0030"))
        ); // true
        
        /* Octal escape test: FAIL */
        
        System.out.println(
            "\45"
        ); // %
        System.out.println(
            StringEscapeUtils.unescapeJava("\\45")
        ); // 45
        System.out.println(
            "\45".equals(StringEscapeUtils.unescapeJava("\\45"))
        ); // false
    
        /* Unicode escape test #2: FAIL */
        
        System.out.println(
            "\uu0030"
        ); // 0
        System.out.println(
            StringEscapeUtils.unescapeJava("\\uu0030")
        ); // throws NestableRuntimeException:
           //   Unable to parse unicode value: u003
    

    A quote from the JLS:

    Octal escapes are provided for compatibility with C, but can express only Unicode values \u0000 through \u00FF, so Unicode escapes are usually preferred.

    If your string can contain octal escapes, you may want to convert them to Unicode escapes first, or use another approach.

    The extraneous u is also documented as follows:

    The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u-for example, \uxxxx becomes \uuxxxx-while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.

    This transformed version is equally acceptable to a compiler for the Java programming language and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.

    If your string can contain Unicode escapes with extraneous u, then you may also need to preprocess this before using StringEscapeUtils.

    Alternatively you can try to write your own Java string literal unescaper from scratch, making sure to follow the exact JLS specifications.

    References

    • JLS 3.3 Unicode Escapes
    • JLS 3.10.6 Escape Sequences for Character and String Literals
    0 讨论(0)
  • 2020-11-22 02:24

    I came across the same problem, but I wasn't enamoured by any of the solutions I found here. So, I wrote one that iterates over the characters of the string using a matcher to find and replace the escape sequences. This solution assumes properly formatted input. That is, it happily skips over nonsensical escapes, and it decodes Unicode escapes for line feed and carriage return (which otherwise cannot appear in a character literal or a string literal, due to the definition of such literals and the order of translation phases for Java source). Apologies, the code is a bit packed for brevity.

    import java.util.Arrays;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    public class Decoder {
    
        // The encoded character of each character escape.
        // This array functions as the keys of a sorted map, from encoded characters to decoded characters.
        static final char[] ENCODED_ESCAPES = { '\"', '\'', '\\',  'b',  'f',  'n',  'r',  't' };
    
        // The decoded character of each character escape.
        // This array functions as the values of a sorted map, from encoded characters to decoded characters.
        static final char[] DECODED_ESCAPES = { '\"', '\'', '\\', '\b', '\f', '\n', '\r', '\t' };
    
        // A pattern that matches an escape.
        // What follows the escape indicator is captured by group 1=character 2=octal 3=Unicode.
        static final Pattern PATTERN = Pattern.compile("\\\\(?:(b|t|n|f|r|\\\"|\\\'|\\\\)|((?:[0-3]?[0-7])?[0-7])|u+(\\p{XDigit}{4}))");
    
        public static CharSequence decodeString(CharSequence encodedString) {
            Matcher matcher = PATTERN.matcher(encodedString);
            StringBuffer decodedString = new StringBuffer();
            // Find each escape of the encoded string in succession.
            while (matcher.find()) {
                char ch;
                if (matcher.start(1) >= 0) {
                    // Decode a character escape.
                    ch = DECODED_ESCAPES[Arrays.binarySearch(ENCODED_ESCAPES, matcher.group(1).charAt(0))];
                } else if (matcher.start(2) >= 0) {
                    // Decode an octal escape.
                    ch = (char)(Integer.parseInt(matcher.group(2), 8));
                } else /* if (matcher.start(3) >= 0) */ {
                    // Decode a Unicode escape.
                    ch = (char)(Integer.parseInt(matcher.group(3), 16));
                }
                // Replace the escape with the decoded character.
                matcher.appendReplacement(decodedString, Matcher.quoteReplacement(String.valueOf(ch)));
            }
            // Append the remainder of the encoded string to the decoded string.
            // The remainder is the longest suffix of the encoded string such that the suffix contains no escapes.
            matcher.appendTail(decodedString);
            return decodedString;
        }
    
        public static void main(String... args) {
            System.out.println(decodeString(args[0]));
        }
    }
    

    I should note that Apache Commons Lang3 doesn't seem to suffer the weaknesses indicated in the accepted solution. That is, StringEscapeUtils seems to handle octal escapes and multiple u characters of Unicode escapes. That means unless you have some burning reason to avoid Apache Commons, you should probably use it rather than my solution (or any other solution here).

    0 讨论(0)
提交回复
热议问题