问题
I'm a regular expression newbie, and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?
回答1:
Try this regular expression:
\b(\w+)\s+\1\b
Here \b
is a word boundary and \1
references the captured match of the first group.
回答2:
I believe this regex handles more situations:
/(\b\S+\b)\s+\b\1\b/
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html
回答3:
Try this with below RE
- \b start of word word boundary
- \W+ any word character
- \1 same word matched already
- \b end of word
()* Repeating again
public static void main(String[] args) { String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";// "/* Write a RegEx matching repeated words here. */"; Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/); Scanner in = new Scanner(System.in); int numSentences = Integer.parseInt(in.nextLine()); while (numSentences-- > 0) { String input = in.nextLine(); Matcher m = p.matcher(input); // Check for subsequences of input that match the compiled pattern while (m.find()) { input = input.replaceAll(m.group(0),m.group(1)); } // Prints the modified sentence. System.out.println(input); } in.close(); }
回答4:
The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):
(\b\w+\b)\W+\1
回答5:
No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.
回答6:
This is the regex I use to remove duplicate phrases in my twitch bot:
(\S+\s*)\1{2,}
(\S+\s*)
looks for any string of characters that isn't whitespace, followed whitespace.
\1{2,}
then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.
回答7:
Here is one that catches multiple words multiple times:
(\b\w+\b)(\s+\1)+
回答8:
The example in Javascript: The Good Parts can be adapted to do this:
var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.
回答9:
Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)
Try this regex that can catch 2 or more duplicates words and only leave behind one single word. And the duplicate words need not even be consecutive.
/\b(\w+)\b(?=.*?\b\1\b)/ig
Here, \b
is used for Word Boundary, ?=
is used for positive lookahead, and \1
is used for back-referencing.
Example Source
回答10:
This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:
/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")
I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)
First, I put (^|\s+)
to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)
), followed by an end of string ($
) or a number of spaces (\s+
), the whole repeated more than once.
I tried it like this and it worked well:
var s = "here here here here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))
--> here is ahi-ahi joe's the result
回答11:
Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern: /(\b\S+)(?:\s+\1\b)+/
(Pattern Demo)
Replace: $1
(replaces the fullstring match with capture group #1)
This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b
(word boundary) characters are vital to ensure partial words are not matched.- The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
- the
+
(one or more quantifier) on the non-capturing group is more appropriate than*
because*
will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.
回答12:
The below expression should work correctly to find any number of consecutive words. The matching can be case insensitive.
String regex = "\\b(\\w+)(\\s+\\1\\b)*";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0), m.group(1));
}
Sample Input : Goodbye goodbye GooDbYe
Sample Output : Goodbye
Explanation:
The regex expression:
\b : Start of a word boundary
\w+ : Any number of word characters
(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.
Grouping :
m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe
m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye
Replace method shall replace all consecutive matched words with the first instance of the word.
回答13:
Use this in case you want case-insensitive checking for duplicate words.
(?i)\\b(\\w+)\\s+\\1\\b
来源:https://stackoverflow.com/questions/2823016/regular-expression-for-consecutive-duplicate-words