发表新帖

发表新帖

How to remove duplicate phrases in Python?

前端未结

关注

 1  1719

一个人的身影

Suppose I have a string such as

\'I hate *some* kinds of duplicate. This string has a duplicate phrase, duplicate phrase.\'

I want to remov

相关标签:

1条回答

旧时难觅i

2021-01-18 12:06
Thanks everyone for your attempts and comments. I have finally found a solution:
```
s = 'I hate *some* kinds of duplicate. This string has a duplicate phrase, duplicate phrase.'
re.sub(r'((\b\w+\b.{1,2}\w+\b)+).+\1', r'\1', s, flags = re.I)
# 'I hate *some* kinds of duplicate. This string has a duplicate phrase.'
```
Explanation

The regular expression
```
r'((\b\w+\b.{1,2}\w+\b)+).+\1'
```
finds every occurrence of multiple runs of alphanumeric characters separated by one or two [any character] (to cover the case where words are separated not just by a space, but perhaps a period or comma and a space), and then repeated following some run of [any character] of indeterminate length. Then
```
re.sub(r'((\b\w+\b.{1,2}\w+\b)+).+\1', r'\1', s, flags = re.I)
```
replaces such occurrences with the first multiple run of alphanumeric characters separated by one or two [any character], being sure to ignore case (since the duplicate phrase could sometimes occur at the beginning of a sentence).
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题