Java-8 regex negative lookbehind with `\R`

前端未结

关注

 2  2059

While answering another question, I wrote a regex to match all whitespace up to and including at most one newline. I did this using negative lookbehind for the \\R


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  陌清茗        
                
              
                            
                2020-12-31 23:43
              
            
            
                                                                       
The construct \R is a macro that surrounds the sub expressions into   an atomic group (?> parts ).  

That's why it won't break them apart.

A note: If Java accepts fixed alternations in a lookbehind, using \R is ok, but if the engine doesn't, this would throw an exception.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南方客        
                
              
                            
                2020-12-31 23:48
              
            
            
                                                                       
Realization #1. The documentation is wrong
Source: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
Here it says:

Linebreak matcher
...is equivalent to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

However, when we try using the "equivalent" pattern, it returns false:
String _R_ = "\\R";
System.out.println("\r\n".matches("((?<!"+_R_+")\\s)*")); // true

// using "equivalent" pattern
_R_ = "\\u000D\\u000A|[\\u000A\\u000B\\u000C\\u000D\\u0085\\u2028\\u2029]";
System.out.println("\r\n".matches("((?<!"+_R_+")\\s)*")); // false

// now make it atomic, as per sln's answer
_R_ = "(?>"+_R_+")";
System.out.println("\r\n".matches("((?<!"+_R_+")\\s)*")); // true

So the Javadoc should really say:

...is equivalent to (?<!\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029])

Update March 9, 2017 per Sherman at Oracle JDK-8176029:

"api doc is NOT wrong, the implementation is wrong (which fails to backtracking "0x0d+next.match()" when "0x0d+0x0a + next.match()" fails)"


Realization #2. Lookbehinds don't only look backwards
Despite the name, a lookbehind is not only able to look backwards, but can include and even jump over the current position.
Consider the following example (from rexegg.com):
"_12_".replaceAll("(?<=_(?=\\d{2}_))\\d+", "##"); // _##_


"This is interesting for several reasons. First, we have a lookahead within a lookbehind, and even though we were supposed to look backwards, this lookahead jumps over the current position by matching the two digits and the trailing underscore. That's acrobatic."

What this means for our example of \R is that even though our current position may be \n, that will not stop the lookbehind from recognizing that its \r is followed by \n, then binding the two together as an atomic group, and consequently refusing to recognize the \r part behind the current position as a separate match.
Note: for simplicity sake I have used terms such as "our current position is \n", however this is not an exact representation of what occurs internally.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复