Suppose I have a string such as
\'I hate *some* kinds of duplicate. This string has a duplicate phrase, duplicate phrase.\'
I want to remov
Thanks everyone for your attempts and comments. I have finally found a solution:
s = 'I hate *some* kinds of duplicate. This string has a duplicate phrase, duplicate phrase.'
re.sub(r'((\b\w+\b.{1,2}\w+\b)+).+\1', r'\1', s, flags = re.I)
# 'I hate *some* kinds of duplicate. This string has a duplicate phrase.'
The regular expression
r'((\b\w+\b.{1,2}\w+\b)+).+\1'
finds every occurrence of multiple runs of alphanumeric characters separated by one or two [any character] (to cover the case where words are separated not just by a space, but perhaps a period or comma and a space), and then repeated following some run of [any character] of indeterminate length. Then
re.sub(r'((\b\w+\b.{1,2}\w+\b)+).+\1', r'\1', s, flags = re.I)
replaces such occurrences with the first multiple run of alphanumeric characters separated by one or two [any character], being sure to ignore case (since the duplicate phrase could sometimes occur at the beginning of a sentence).