REGEX - Matching any character which repeats n times

前端 未结 4 1597
有刺的猬
有刺的猬 2021-01-01 19:45

How to match any character which repeats n times?

Example:

for input: abcdbcdcdd
for n=1:   ..........
for n=2:    .........
for n=3:            


        
相关标签:
4条回答
  • 2021-01-01 20:07

    I would not use regular expressions for this. I would use a scripting language such as python. Try out this python function:

    alpha = 'abcdefghijklmnopqrstuvwxyz'
    def get_matched_chars(n, s):
        s = s.lower()
        return [char for char in alpha if s.count(char) == n]
    

    The function will return a list of characters, all of which appear in the string s exactly n times. Keep in mind that I only included letters in my alphabet. You can change alpha to represent anything that you want to get matched.

    0 讨论(0)
  • 2021-01-01 20:09

    let's look for n=4 line, d's lookahead assertion satisfied and first d matched by regex. But remaining d's are not matched because they don't have 3 more d's ahead of them.

    And obviously, without regex, this is a very simple string manipulation problem. I'm trying to do this with and only with regex.

    As with any regex implementation, the answer depends on the regex flavour. You could create a solution with .net regex engine, because it allows variable width lookbehinds.

    Also, I'll provide a more generalized solution below for perl-compatible/like regex flavours.


    .net Solution

    As @PetSerAl pointed out in his answer, with variable width lookbehinds, we can assert back to the beggining of the string, and check there are n occurrences.
    ideone demo

    regex module in Python
    You can implement this solution in python, using the regex module by Matthew Barnett, which also allows variable-width lookbehinds.

    >>> import regex
    >>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){2})\A.*)', 'abcdbcdcdd')
    ['b', 'c', 'd', 'b', 'c', 'd', 'c', 'd', 'd']
    >>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){3})\A.*)', 'abcdbcdcdd')
    ['c', 'd', 'c', 'd', 'c', 'd', 'd']
    >>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){4})\A.*)', 'abcdbcdcdd')
    ['d', 'd', 'd', 'd']
    >>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){5})\A.*)', 'abcdbcdcdd')
    []
    


    Generalized Solution

    In pcre or any of the "perl-like" flavours, there is no solution that would actually return a match for every repeated character, but we could create one, and only one, capture for each character.

    Strategy

    For any given n, the logic involves:

    1. Early matches: Match and capture every character followed by at least n more occurences.
    2. Final captures:
      • Match and capture a character followed by exactly n-1 occurences, and
      • also capture every one of the following occurrences.

    Example

    for n = 3
    input = abcdbcdcdd
    

    The character c is Matched only once (as final), and the following 2 occurrences are also Captured in the same match:

    abcdbcdcdd
      M  C C
    

    and the character d is (early) Matched once:

    abcdbcdcdd
       M
    

    and (finally) Matched one more time, Capturing the rest:

    abcdbcdcdd
          M CC
    

    Regex

    /(\w)                        # match 1 character
    (?:
        (?=(?:.*?\1){≪N≫})     # [1] followed by other ≪N≫ occurrences
      |                          #   OR
        (?=                      # [2] followed by:
            (?:(?!\1).)*(\1)     #      2nd occurence <captured>
            (?:(?!\1).)*(\1)     #      3rd occurence <captured>
            ≪repeat previous≫  #      repeat subpattern (n-1) times
                                 #     *exactly (n-1) times*
            (?!.*?\1)            #     not followed by another occurence
        )
    )/xg
    

    For n =

    1. /(\w)(?:(?=(?:.*?\1){2})|(?=(?:(?!\1).)*(\1)(?!.*?\1)))/g
      demo
    2. /(\w)(?:(?=(?:.*?\1){3})|(?=(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?!.*?\1)))/g
      demo
    3. /(\w)(?:(?=(?:.*?\1){4})|(?=(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?!.*?\1)))/g
      demo
    4. ... etc.

    Pseudocode to generate the pattern

    // Variables: N (int)
    
    character = "(\w)"
    early_match = "(?=(?:.*?\1){" + N + "})"
    
    final_match = "(?="
    for i = 1; i < N; i++
        final_match += "(?:(?!\1).)*(\1)"
    final_match += "(?!.*?\1))"
    
    pattern = character + "(?:" + early_match + "|" + final_match + ")"
    

    JavaScript Code

    I'll show an implementation using javascript because we can check the result here (and if it works in javascript, it works in any perl-compatible regex flavour, including .net, java, python, ruby, perl, and all languages that implemented pcre, among others).

    var str = 'abcdbcdcdd';
    var pattern, re, match, N, i;
    var output = "";
    
    // We'll show the results for N = 2, 3 and 4
    for (N = 2; N <= 4; N++) {
        // Generate pattern
        pattern = "(\\w)(?:(?=(?:.*?\\1){" + N + "})|(?=";
        for (i = 1; i < N; i++) {
            pattern += "(?:(?!\\1).)*(\\1)";
        }
        pattern += "(?!.*?\\1)))";
        re = new RegExp(pattern, "g");
        
        output += "<h3>N = " + N + "</h3><pre>Pattern: " + pattern + "\nText: " + str;
        
        // Loop all matches
        while ((match = re.exec(str)) !== null) {
            output += "\nPos: " + match.index + "\tMatch:";
            // Loop all captures
            x = 1;
            while (match[x] != null) {
                output += " " + match[x];
                x++;
            }
        }
        
        output += "</pre>";
    }
    
    document.write(output);

    Python3 code

    As requested by the OP, I'm linking to a Python3 implementation in ideone.com

    0 讨论(0)
  • 2021-01-01 20:09

    Regular expressions (and finite automata) are not able to count to arbitrary integers. They can only count to a predefined integer and fortunately this is your case.

    Solving this problem is much easier if we first construct a nondeterministic finite automata (NFA) ad then convert it to regular expression.

    So the following automata for n=2 and input alphabet = {a,b,c,d}

    will match any string that has exactly 2 repetitions of any char. If no character has 2 repetitions (all chars appear less or more that two times) the string will not match.

    Converting it to regex should look like

    "^([^a]*a[^a]*a[^a]*)|([^b]*b[^b]*b[^b]*)|([^b]*c[^c]*c[^C]*)|([^d]*d[^d]*d[^d]*)$"
    

    This can get problematic if the input alphabet is big, so that regex should be shortened somehow, but I can't think of it right now.

    0 讨论(0)
  • 2021-01-01 20:11

    With .NET regular expressions you can do following:

    (\w)(?<=(?=(?:.*\1){n})^.*) where n is variable
    

    Where:

    • (\w) — any character, captured in first group.
    • (?<=^.*) — lookbehind assertion, which return us to the start of the string.
    • (?=(?:.*\1){n}) — lookahead assertion, to see if string have n instances of that character.

    Demo

    0 讨论(0)
提交回复
热议问题