How to unescape a Java string literal in Java?

后端 未结 11 1808
庸人自扰
庸人自扰 2020-11-22 01:35

I\'m processing some Java source code using Java. I\'m extracting the string literals and feeding them to a function taking a String. The problem is that I need to pass the

11条回答
  •  挽巷
    挽巷 (楼主)
    2020-11-22 02:24

    I came across the same problem, but I wasn't enamoured by any of the solutions I found here. So, I wrote one that iterates over the characters of the string using a matcher to find and replace the escape sequences. This solution assumes properly formatted input. That is, it happily skips over nonsensical escapes, and it decodes Unicode escapes for line feed and carriage return (which otherwise cannot appear in a character literal or a string literal, due to the definition of such literals and the order of translation phases for Java source). Apologies, the code is a bit packed for brevity.

    import java.util.Arrays;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    public class Decoder {
    
        // The encoded character of each character escape.
        // This array functions as the keys of a sorted map, from encoded characters to decoded characters.
        static final char[] ENCODED_ESCAPES = { '\"', '\'', '\\',  'b',  'f',  'n',  'r',  't' };
    
        // The decoded character of each character escape.
        // This array functions as the values of a sorted map, from encoded characters to decoded characters.
        static final char[] DECODED_ESCAPES = { '\"', '\'', '\\', '\b', '\f', '\n', '\r', '\t' };
    
        // A pattern that matches an escape.
        // What follows the escape indicator is captured by group 1=character 2=octal 3=Unicode.
        static final Pattern PATTERN = Pattern.compile("\\\\(?:(b|t|n|f|r|\\\"|\\\'|\\\\)|((?:[0-3]?[0-7])?[0-7])|u+(\\p{XDigit}{4}))");
    
        public static CharSequence decodeString(CharSequence encodedString) {
            Matcher matcher = PATTERN.matcher(encodedString);
            StringBuffer decodedString = new StringBuffer();
            // Find each escape of the encoded string in succession.
            while (matcher.find()) {
                char ch;
                if (matcher.start(1) >= 0) {
                    // Decode a character escape.
                    ch = DECODED_ESCAPES[Arrays.binarySearch(ENCODED_ESCAPES, matcher.group(1).charAt(0))];
                } else if (matcher.start(2) >= 0) {
                    // Decode an octal escape.
                    ch = (char)(Integer.parseInt(matcher.group(2), 8));
                } else /* if (matcher.start(3) >= 0) */ {
                    // Decode a Unicode escape.
                    ch = (char)(Integer.parseInt(matcher.group(3), 16));
                }
                // Replace the escape with the decoded character.
                matcher.appendReplacement(decodedString, Matcher.quoteReplacement(String.valueOf(ch)));
            }
            // Append the remainder of the encoded string to the decoded string.
            // The remainder is the longest suffix of the encoded string such that the suffix contains no escapes.
            matcher.appendTail(decodedString);
            return decodedString;
        }
    
        public static void main(String... args) {
            System.out.println(decodeString(args[0]));
        }
    }
    

    I should note that Apache Commons Lang3 doesn't seem to suffer the weaknesses indicated in the accepted solution. That is, StringEscapeUtils seems to handle octal escapes and multiple u characters of Unicode escapes. That means unless you have some burning reason to avoid Apache Commons, you should probably use it rather than my solution (or any other solution here).

提交回复
热议问题