Regular Expression For Duplicate Words

后端未结

关注

 13  1855

I\'m a regular expression newbie, and I can\'t quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as

相关标签:

13条回答

温柔的废话

2020-11-22 11:51

Try this with below RE

\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word

()* Repeating again

public static void main(String[] args) {

    String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";//  "/* Write a RegEx matching repeated words here. */";
    Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);

    Scanner in = new Scanner(System.in);

    int numSentences = Integer.parseInt(in.nextLine());

    while (numSentences-- > 0) {
        String input = in.nextLine();

        Matcher m = p.matcher(input);

        // Check for subsequences of input that match the compiled pattern
        while (m.find()) {
            input = input.replaceAll(m.group(0),m.group(1));
        }

        // Prints the modified sentence.
        System.out.println(input);
    }

    in.close();
}

0 讨论(0)

挽巷

2020-11-22 11:51
This is the regex I use to remove duplicate phrases in my twitch bot:
```
(\S+\s*)\1{2,}
```
(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.

\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.
0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2020-11-22 11:51
Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.

Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)

This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).

Specifically:
- \b (word boundary) characters are vital to ensure partial words are not matched.
- The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
- the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.
0 讨论(0)
发布评论:

提交评论
- 加载中...
有刺的猬

2020-11-22 11:52
The below expression should work correctly to find any number of consecutive words. The matching can be case insensitive.
```
String regex = "\\b(\\w+)(\\s+\\1\\b)*";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

Matcher m = p.matcher(input);

// Check for subsequences of input that match the compiled pattern
while (m.find()) {
     input = input.replaceAll(m.group(0), m.group(1));
}
```
Sample Input : Goodbye goodbye GooDbYe

Sample Output : Goodbye

Explanation:

The regex expression:

\b : Start of a word boundary

\w+ : Any number of word characters

(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.

Grouping :

m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe

m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye

Replace method shall replace all consecutive matched words with the first instance of the word.
0 讨论(0)
发布评论:

提交评论
- 加载中...
滥情空心

2020-11-22 11:52

No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.

0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2020-11-22 11:56
Try this regular expression:
```
\b(\w+)\s+\1\b
```
Here \b is a word boundary and \1 references the captured match of the first group.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 下一页