Regular expression that both includes and excludes certain strings in R

前端未结

关注

 3  1456

I am trying to use R to parse through a number of entries. I have two requirements for the the entries I want back. I want all the entries that contain the word apple<


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  花落未央        
                
              
                            
                2020-12-10 14:45
              
            
            
                                                                       
Could do

temp <- c("I like apples", "I really like apples", "I like apples and oranges")
temp[grepl("apple", temp) & !grepl("orange", temp)]

## [1] "I like apples"      "I really like apples"

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  囚心锁ツ        
                
              
                            
                2020-12-10 14:57
              
            
            
                                                                       
Using a regular expression, you could do the following.

x <- c('I like apples', 'I really like apples', 
       'I like apples and oranges', 'I like oranges and apples',
       'I really like oranges and apples but oranges more')

x[grepl('^((?!.*orange).)*apple.*$', x, perl=TRUE)]
# [1] "I like apples"        "I really like apples"


The regular expression looks ahead to see if there's no character except a line break and no substring orange and if so, then the dot . will match any character except a line break as it is wrapped in a group, and repeated (0 or more times). Next we look for apple and any character except a line break (0 or more times).  Finally, the start and end of line anchors are in place to make sure the input is consumed.



UPDATE: You could use the following if performance is an issue.

x[grepl('^(?!.*orange).*$', x, perl=TRUE)]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2020-12-10 14:59
              
            
            
                                                                       
This regex is a bit smaller and much faster than the other regex versions (see comparison below). I don't have the tools to compare to David's double grepl so if someone can compare the single grep below vs the double grepl we'll be able to know. The comparison must be done both for a success case and a failure case. 

^(?!.*orange).*apple.*$



The negative lookahead ensures we don't have orange
We just match the string, so long as it contains apple. No need for a lookahead there.


Code Sample

grep("^(?!.*orange).*apple.*$", subject, perl=TRUE, value=TRUE);


Speed Comparison

@hwnd has now removed that double lookahead version, but according to RegexBuddy the speed difference remains:


Against I like apples and oranges, the engine takes 22 steps to fail, vs. 143 for the double lookahead version ^(?=.*apple)((?!orange).)*$ and 22 steps for ^((?!.*orange).)*apple.*$ (equal there but wait for point 2). 
Against I really like apples, the engine takes 64 steps to succeed, vs. 104 for the double lookahead version ^(?=.*apple)((?!orange).)*$ and 538 steps for ^((?!.*orange).)*apple.*$.


These numbers were provided by the RegexBuddy debugger.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复