Java regex escaped characters

后端 未结 2 1266
小鲜肉
小鲜肉 2021-01-22 11:45

When matching certain characters (such as line feed), you can use the regex \"\\\\n\" or indeed just \"\\n\". For example, the following splits a string into an array of lines:<

相关标签:
2条回答
  • 2021-01-22 11:52

    Yes there are different. The Java Compiler has different behavior for Unicode Escapes in the Java Book The Java Language Specification section 3.3;

    The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u - for example, \uxxxx becomes \uuxxxx - while simultaneously converting non- ASCII characters in the source text to Unicode escapes containing a single u each.

    So how this affect the /n vs //n in the Java Doc:

    It is therefore necessary to double backslashes in string literals that represent regular expressions to protect them from interpretation by the Java bytecode compiler.

    An a example of the same doc:

    The string literal "\b", for example, matches a single backspace character when interpreted as a regular expression, while "\b" matches a word boundary. The string literal "(hello)" is illegal and leads to a compile-time error; in order to match the string (hello) the string literal "\(hello\)" must be used.

    0 讨论(0)
  • 2021-01-22 12:08

    There is no difference in the current scenario. The usual string escape sequences are formed with the help of a single backslash and then a valid escape char ("\n", "\r", etc.) and regex escape sequences are formed with the help of a literal backslash (that is, a double backslash in the Java string literal) and a valid regex escape char ("\\n", "\\d", etc.).

    "\n" (an escape sequence) is a literal LF (newline) and "\\n" is a regex escape sequence that matches an LF symbol.

    "\r" (an escape sequence) is a literal CR (carriage return) and "\\r" is a regex escape sequence that matches an CR symbol.

    "\t" (an escape sequence) is a literal tab symbol and "\\t" is a regex escape sequence that matches a tab symbol.

    See the list in the Java regex docs for the supported list of regex escapes.

    However, if you use a Pattern.COMMENTS flag (used to introduce comments and format a pattern nicely, making the regex engine ignore all unescaped whitespace in the pattern), you will need to either use "\\n" or "\\\n" to define a newline (LF) in the Java string literal and "\\r" or "\\\r" to define a carriage return (CR).

    See a Java test:

    String s = "\n";
    System.out.println(s.replaceAll("\n", "LF")); // => LF
    System.out.println(s.replaceAll("\\n", "LF")); // => LF
    System.out.println(s.replaceAll("(?x)\\n", "LF")); // => LF
    System.out.println(s.replaceAll("(?x)\\\n", "LF")); // => LF
    System.out.println(s.replaceAll("(?x)\n", "<LF>")); 
    // => <LF>
    //<LF>
    

    Why is the last one producing <LF>+newline+<LF>? Because "(?x)\n" is equal to "", an empty pattern, and it matches an empty space before the newline and after it.

    0 讨论(0)
提交回复
热议问题