How to remove duplicate phrases in Python?

前端 未结 1 1710
一个人的身影
一个人的身影 2021-01-18 11:42

Suppose I have a string such as

\'I hate *some* kinds of duplicate. This string has a duplicate phrase, duplicate phrase.\'

I want to remov

相关标签:
1条回答
  • 2021-01-18 12:06

    Thanks everyone for your attempts and comments. I have finally found a solution:

    s = 'I hate *some* kinds of duplicate. This string has a duplicate phrase, duplicate phrase.'
    re.sub(r'((\b\w+\b.{1,2}\w+\b)+).+\1', r'\1', s, flags = re.I)
    # 'I hate *some* kinds of duplicate. This string has a duplicate phrase.'
    

    Explanation

    The regular expression

    r'((\b\w+\b.{1,2}\w+\b)+).+\1'
    

    finds every occurrence of multiple runs of alphanumeric characters separated by one or two [any character] (to cover the case where words are separated not just by a space, but perhaps a period or comma and a space), and then repeated following some run of [any character] of indeterminate length. Then

    re.sub(r'((\b\w+\b.{1,2}\w+\b)+).+\1', r'\1', s, flags = re.I)
    

    replaces such occurrences with the first multiple run of alphanumeric characters separated by one or two [any character], being sure to ignore case (since the duplicate phrase could sometimes occur at the beginning of a sentence).

    0 讨论(0)
提交回复
热议问题