Finding all of the matching substrings, not only the “most extended” one

后端未结

关注

 5  1864

The code

String s = \"y z a a a b c c z\";
Pattern p = Pattern.compile(\"(a )+(b )+(c *)c\");
Matcher m = p.matcher(s);
while (m.find()) {
    System.out.pri


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  有刺的猬        
                
              
                            
                2020-12-06 06:08
              
            
            
                                                                       
You will need a lazy quantifier.

Please try the following:

Pattern p = Pattern.compile("(a )+(b )+((c )*?)c");


Please  also notice, that I grouped "c" once again, since I think that's what you want. Otherwise you would find arbitrarily many spaces, but not "c". 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤独总比滥情好        
                
              
                            
                2020-12-06 06:13
              
            
            
                                                                       
I don't know of any regex engines that can give back all valid matches.

But we can apply a bit of logic to generate all candidates string and present it to the regex.

A candidate is constructed by enumerating all possible substring of a given input.

var str = "y z a a a b c c z y z a a a b c c z";
var regex = new Regex("(a )+(b )+(c *)c");

var length = str.Length;

for (int start = 1; start <= length;start++){

    for (int groupLength = 1;  start + groupLength - 1 <= length ;groupLength++){

        var candidate = str.Substring(start-1,groupLength); //.Dump();

        //("\"" + candidate + "\"").Dump();

        var match = regex.Match(candidate);

        if (match.Value == candidate )
        {
            candidate.Dump();
        }

    }
}


This gives      

a a a b c c 
a a b c c 
a b c c 


which seems the correct answer but contradicts your result :

a a a b c => I state that this is not a match
a a b c c ok
a a b c => I state that this is not a match
a b c c ok
a b c => I state that this is not a match


For example, the regex that you give

(a )+(b )+(c *)c


does not match the first entry in your result

a a a b c 


The logic above can generate identical matches if you consider starting position not important. For example if you just repeat the given input another time :

"y z a a a b c c z y z a a a b c c z"


It will give :

a a a b c c
a a b c c
a b c c
a a a b c c
a a b c c
a b c c


If you consider position not important you should do a distinct on this result

The trivial case where the input is the empty string should als be added if considered a potential match.

FYI, this are all the candidates that the regex examines

"y"
"y "
"y z"
"y z "
"y z a"
"y z a "
"y z a a"
"y z a a "
"y z a a a"
"y z a a a "
"y z a a a b"
"y z a a a b "
"y z a a a b c"
"y z a a a b c "
"y z a a a b c c"
"y z a a a b c c "
"y z a a a b c c z"
" "
" z"
" z "
" z a"
" z a "
" z a a"
" z a a "
" z a a a"
" z a a a "
" z a a a b"
" z a a a b "
" z a a a b c"
" z a a a b c "
" z a a a b c c"
" z a a a b c c "
" z a a a b c c z"
"z"
"z "
"z a"
"z a "
"z a a"
"z a a "
"z a a a"
"z a a a "
"z a a a b"
"z a a a b "
"z a a a b c"
"z a a a b c "
"z a a a b c c"
"z a a a b c c "
"z a a a b c c z"
" "
" a"
" a "
" a a"
" a a "
" a a a"
" a a a "
" a a a b"
" a a a b "
" a a a b c"
" a a a b c "
" a a a b c c"
" a a a b c c "
" a a a b c c z"
"a"
"a "
"a a"
"a a "
"a a a"
"a a a "
"a a a b"
"a a a b "
"a a a b c"
"a a a b c "
"a a a b c c"
"a a a b c c "
"a a a b c c z"
" "
" a"
" a "
" a a"
" a a "
" a a b"
" a a b "
" a a b c"
" a a b c "
" a a b c c"
" a a b c c "
" a a b c c z"
"a"
"a "
"a a"
"a a "
"a a b"
"a a b "
"a a b c"
"a a b c "
"a a b c c"
"a a b c c "
"a a b c c z"
" "
" a"
" a "
" a b"
" a b "
" a b c"
" a b c "
" a b c c"
" a b c c "
" a b c c z"
"a"
"a "
"a b"
"a b "
"a b c"
"a b c "
"a b c c"
"a b c c "
"a b c c z"
" "
" b"
" b "
" b c"
" b c "
" b c c"
" b c c "
" b c c z"
"b"
"b "
"b c"
"b c "
"b c c"
"b c c "
"b c c z"
" "
" c"
" c "
" c c"
" c c "
" c c z"
"c"
"c "
"c c"
"c c "
"c c z"
" "
" c"
" c "
" c z"
"c"
"c "
"c z"
" "
" z"
"z"


Also it's good to know how the 2 main types of regexes (NFA and DFA) do their work

from http://msdn.microsoft.com/en-us/library/e347654k.aspx


  .NET (and JAVA too I think) are NFA regex engines (as opposed to DFA)
  and as it processes a particular language element, the engine uses
  greedy matching; that is, it matches as much of the input string as it
  possibly can. But it also saves its state after successfully matching
  a subexpression. If a match eventually fails, the engine can return to
  a saved state so it can try additional matches. This process of
  abandoning a successful subexpression match so that later language
  elements in the regular expression can also match is known as
  backtracking.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  太阳男子        
                
              
                            
                2020-12-06 06:20
              
            
            
                                                                       
The only way I can think of here would be to generate a list of all possible substrings of your original string and match the regex against each of them, retaining those items where it matched.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  温柔的废话        
                
              
                            
                2020-12-06 06:27
              
            
            
                                                                       
Given these very specific constraints (i.e. this is not a general case solution), this will work:

import java.util.Set;
import java.util.TreeSet;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class test {

    public static void main(String[] args) {

        String s = "y z a a a b c c z";

        Pattern p = Pattern.compile("(a )+(b )+(c ?)+");
        Set<String> set = recurse(s, p, 0);
    }

    public static Set<String> recurse(String s, Pattern p, int depth) {
        int temp = depth;
        while(temp>0) {
            System.out.print("  ");
            temp--;
        }
        System.out.println("-> " +s);

        Matcher matcher = p.matcher(s);
        Set<String> set = new TreeSet<String>();

        if(matcher.find()) {
            String found = matcher.group().trim();
            set.add(found);
            set.addAll(recurse(found.substring(1), p, depth+1));
            set.addAll(recurse(found.substring(0, found.length()-1), p, depth+1));
        }

        while(depth>0) {
            System.out.print("  ");
            depth--;
        }
        System.out.println("<- " +s);
        return set;
    }
}


I'm reasonably sure you can adapt it to work on other cases, but the recursion into the matched string means that overlapping matches (like the one pointed out by @ahenderson) won't work.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦毁少年i        
                
              
                            
                2020-12-06 06:32
              
            
            
                                                                       
You can use the reluctant qualifiers such as *? and +?. These match as little as possible, in contrast to the standard * and + which are greedy, i.e. match as much as possible. Still, this only allows you to find particular "sub-matches", not all of them. Some more control can be achieved using lookahead controlling non-capturing groups, also described in the docs. But in order to really find all sub-matches, you would probably have to do stuff yourself, i.e. build the automaton to which the regex corresponds and navigate it using custom code.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复