Regular Expression For Duplicate Words

后端 未结 13 1793
终归单人心
终归单人心 2020-11-22 11:13

I\'m a regular expression newbie, and I can\'t quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as

相关标签:
13条回答
  • 2020-11-22 11:51

    Try this with below RE

    • \b start of word word boundary
    • \W+ any word character
    • \1 same word matched already
    • \b end of word
    • ()* Repeating again

      public static void main(String[] args) {
      
          String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";//  "/* Write a RegEx matching repeated words here. */";
          Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
      
          Scanner in = new Scanner(System.in);
      
          int numSentences = Integer.parseInt(in.nextLine());
      
          while (numSentences-- > 0) {
              String input = in.nextLine();
      
              Matcher m = p.matcher(input);
      
              // Check for subsequences of input that match the compiled pattern
              while (m.find()) {
                  input = input.replaceAll(m.group(0),m.group(1));
              }
      
              // Prints the modified sentence.
              System.out.println(input);
          }
      
          in.close();
      }
      
    0 讨论(0)
  • 2020-11-22 11:51

    This is the regex I use to remove duplicate phrases in my twitch bot:

    (\S+\s*)\1{2,}
    

    (\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.

    \1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.

    0 讨论(0)
  • Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.

    Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
    Replace: $1 (replaces the fullstring match with capture group #1)

    This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).

    Specifically:

    • \b (word boundary) characters are vital to ensure partial words are not matched.
    • The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
    • the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.

    *note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.

    0 讨论(0)
  • 2020-11-22 11:52

    The below expression should work correctly to find any number of consecutive words. The matching can be case insensitive.

    String regex = "\\b(\\w+)(\\s+\\1\\b)*";
    Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
    
    Matcher m = p.matcher(input);
    
    // Check for subsequences of input that match the compiled pattern
    while (m.find()) {
         input = input.replaceAll(m.group(0), m.group(1));
    }
    

    Sample Input : Goodbye goodbye GooDbYe

    Sample Output : Goodbye

    Explanation:

    The regex expression:

    \b : Start of a word boundary

    \w+ : Any number of word characters

    (\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.

    Grouping :

    m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe

    m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye

    Replace method shall replace all consecutive matched words with the first instance of the word.

    0 讨论(0)
  • 2020-11-22 11:52

    No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.

    0 讨论(0)
  • 2020-11-22 11:56

    Try this regular expression:

    \b(\w+)\s+\1\b
    

    Here \b is a word boundary and \1 references the captured match of the first group.

    0 讨论(0)
提交回复
热议问题