Finding all of the matching substrings, not only the “most extended” one

后端 未结 5 1864
甜味超标
甜味超标 2020-12-06 06:02

The code

String s = \"y z a a a b c c z\";
Pattern p = Pattern.compile(\"(a )+(b )+(c *)c\");
Matcher m = p.matcher(s);
while (m.find()) {
    System.out.pri         


        
相关标签:
5条回答
  • 2020-12-06 06:08

    You will need a lazy quantifier.

    Please try the following:

    Pattern p = Pattern.compile("(a )+(b )+((c )*?)c");
    

    Please also notice, that I grouped "c" once again, since I think that's what you want. Otherwise you would find arbitrarily many spaces, but not "c".

    0 讨论(0)
  • 2020-12-06 06:13

    I don't know of any regex engines that can give back all valid matches.

    But we can apply a bit of logic to generate all candidates string and present it to the regex.

    A candidate is constructed by enumerating all possible substring of a given input.

    var str = "y z a a a b c c z y z a a a b c c z";
    var regex = new Regex("(a )+(b )+(c *)c");
    
    var length = str.Length;
    
    for (int start = 1; start <= length;start++){
    
        for (int groupLength = 1;  start + groupLength - 1 <= length ;groupLength++){
    
            var candidate = str.Substring(start-1,groupLength); //.Dump();
    
            //("\"" + candidate + "\"").Dump();
    
            var match = regex.Match(candidate);
    
            if (match.Value == candidate )
            {
                candidate.Dump();
            }
    
        }
    }
    

    This gives

    a a a b c c 
    a a b c c 
    a b c c 
    

    which seems the correct answer but contradicts your result :

    a a a b c => I state that this is not a match
    a a b c c ok
    a a b c => I state that this is not a match
    a b c c ok
    a b c => I state that this is not a match
    

    For example, the regex that you give

    (a )+(b )+(c *)c
    

    does not match the first entry in your result

    a a a b c 
    

    The logic above can generate identical matches if you consider starting position not important. For example if you just repeat the given input another time :

    "y z a a a b c c z y z a a a b c c z"
    

    It will give :

    a a a b c c
    a a b c c
    a b c c
    a a a b c c
    a a b c c
    a b c c
    

    If you consider position not important you should do a distinct on this result

    The trivial case where the input is the empty string should als be added if considered a potential match.

    FYI, this are all the candidates that the regex examines

    "y"
    "y "
    "y z"
    "y z "
    "y z a"
    "y z a "
    "y z a a"
    "y z a a "
    "y z a a a"
    "y z a a a "
    "y z a a a b"
    "y z a a a b "
    "y z a a a b c"
    "y z a a a b c "
    "y z a a a b c c"
    "y z a a a b c c "
    "y z a a a b c c z"
    " "
    " z"
    " z "
    " z a"
    " z a "
    " z a a"
    " z a a "
    " z a a a"
    " z a a a "
    " z a a a b"
    " z a a a b "
    " z a a a b c"
    " z a a a b c "
    " z a a a b c c"
    " z a a a b c c "
    " z a a a b c c z"
    "z"
    "z "
    "z a"
    "z a "
    "z a a"
    "z a a "
    "z a a a"
    "z a a a "
    "z a a a b"
    "z a a a b "
    "z a a a b c"
    "z a a a b c "
    "z a a a b c c"
    "z a a a b c c "
    "z a a a b c c z"
    " "
    " a"
    " a "
    " a a"
    " a a "
    " a a a"
    " a a a "
    " a a a b"
    " a a a b "
    " a a a b c"
    " a a a b c "
    " a a a b c c"
    " a a a b c c "
    " a a a b c c z"
    "a"
    "a "
    "a a"
    "a a "
    "a a a"
    "a a a "
    "a a a b"
    "a a a b "
    "a a a b c"
    "a a a b c "
    "a a a b c c"
    "a a a b c c "
    "a a a b c c z"
    " "
    " a"
    " a "
    " a a"
    " a a "
    " a a b"
    " a a b "
    " a a b c"
    " a a b c "
    " a a b c c"
    " a a b c c "
    " a a b c c z"
    "a"
    "a "
    "a a"
    "a a "
    "a a b"
    "a a b "
    "a a b c"
    "a a b c "
    "a a b c c"
    "a a b c c "
    "a a b c c z"
    " "
    " a"
    " a "
    " a b"
    " a b "
    " a b c"
    " a b c "
    " a b c c"
    " a b c c "
    " a b c c z"
    "a"
    "a "
    "a b"
    "a b "
    "a b c"
    "a b c "
    "a b c c"
    "a b c c "
    "a b c c z"
    " "
    " b"
    " b "
    " b c"
    " b c "
    " b c c"
    " b c c "
    " b c c z"
    "b"
    "b "
    "b c"
    "b c "
    "b c c"
    "b c c "
    "b c c z"
    " "
    " c"
    " c "
    " c c"
    " c c "
    " c c z"
    "c"
    "c "
    "c c"
    "c c "
    "c c z"
    " "
    " c"
    " c "
    " c z"
    "c"
    "c "
    "c z"
    " "
    " z"
    "z"
    

    Also it's good to know how the 2 main types of regexes (NFA and DFA) do their work

    from http://msdn.microsoft.com/en-us/library/e347654k.aspx

    .NET (and JAVA too I think) are NFA regex engines (as opposed to DFA) and as it processes a particular language element, the engine uses greedy matching; that is, it matches as much of the input string as it possibly can. But it also saves its state after successfully matching a subexpression. If a match eventually fails, the engine can return to a saved state so it can try additional matches. This process of abandoning a successful subexpression match so that later language elements in the regular expression can also match is known as backtracking.

    0 讨论(0)
  • 2020-12-06 06:20

    The only way I can think of here would be to generate a list of all possible substrings of your original string and match the regex against each of them, retaining those items where it matched.

    0 讨论(0)
  • 2020-12-06 06:27

    Given these very specific constraints (i.e. this is not a general case solution), this will work:

    import java.util.Set;
    import java.util.TreeSet;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    public class test {
    
        public static void main(String[] args) {
    
            String s = "y z a a a b c c z";
    
            Pattern p = Pattern.compile("(a )+(b )+(c ?)+");
            Set<String> set = recurse(s, p, 0);
        }
    
        public static Set<String> recurse(String s, Pattern p, int depth) {
            int temp = depth;
            while(temp>0) {
                System.out.print("  ");
                temp--;
            }
            System.out.println("-> " +s);
    
            Matcher matcher = p.matcher(s);
            Set<String> set = new TreeSet<String>();
    
            if(matcher.find()) {
                String found = matcher.group().trim();
                set.add(found);
                set.addAll(recurse(found.substring(1), p, depth+1));
                set.addAll(recurse(found.substring(0, found.length()-1), p, depth+1));
            }
    
            while(depth>0) {
                System.out.print("  ");
                depth--;
            }
            System.out.println("<- " +s);
            return set;
        }
    }
    

    I'm reasonably sure you can adapt it to work on other cases, but the recursion into the matched string means that overlapping matches (like the one pointed out by @ahenderson) won't work.

    0 讨论(0)
  • 2020-12-06 06:32

    You can use the reluctant qualifiers such as *? and +?. These match as little as possible, in contrast to the standard * and + which are greedy, i.e. match as much as possible. Still, this only allows you to find particular "sub-matches", not all of them. Some more control can be achieved using lookahead controlling non-capturing groups, also described in the docs. But in order to really find all sub-matches, you would probably have to do stuff yourself, i.e. build the automaton to which the regex corresponds and navigate it using custom code.

    0 讨论(0)
提交回复
热议问题