Is there a way to remove duplicate and continuous words/phrases in a string?

前端 未结 6 400
广开言路
广开言路 2021-01-13 11:27

Is there a way to remove duplicate and continuous words/phrases in a string? E.g.

[in]: foo foo bar bar foo bar

相关标签:
6条回答
  • 2021-01-13 11:56

    You can use re module for that.

    >>> s = 'foo foo bar bar'
    >>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
    'foo bar'
    
    >>> s = 'foo bar foo bar foo bar'
    >>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
    'foo bar foo bar'
    

    If you want to match any number of consecutive occurrences:

    >>> s = 'foo bar foo bar foo bar'
    >>> re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
    'foo bar'    
    

    Edit. An addition for your last example. To do so you'll have to call re.sub while there're duplicate phrases. So:

    >>> s = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate'
    >>> while re.search(r'\b(.+)(\s+\1\b)+', s):
    ...   s = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
    ...
    >>> s
    'this is a sentence where phrases duplicate'
    
    0 讨论(0)
  • 2021-01-13 12:00

    Personally, I do not think we need to use any other modules for this (although I admit some of them are GREAT). I just managed this with simple looping by first converting the string into a list. I tried it on all the examples listed above. It works fine.

    sentence = str(raw_input("Please enter your sentence:\n"))
    
    word_list = sentence.split()
    
    def check_if_same(i,j): # checks if two sets of lists are the same
    
        global word_list
        next = (2*j)-i   # this gets the end point for the second of the two lists to compare (it is essentially j + phrase_len)
        is_same = False
        if word_list[i:j] == word_list[j:next]:
    
            is_same = True
            # The line below is just for debugging. Prints lists we are comparing and whether it thinks they are equal or not
            #print "Comparing: " + ' '.join(word_list[i:j]) + " " + ''.join(word_list[j:next]) + " " + str(answer)
    
        return is_same
    
    phrase_len = 1
    
    while phrase_len <= int(len(word_list) / 2): # checks the sentence for different phrase lengths
    
        curr_word_index=0
    
        while curr_word_index < len(word_list): # checks all the words of the sentence for the specified phrase length
    
            result = check_if_same(curr_word_index, curr_word_index + phrase_len) # checks similarity
    
            if result == True:
                del(word_list[curr_word_index : curr_word_index + phrase_len]) # deletes the repeated phrase
            else:
                curr_word_index += 1
    
        phrase_len += 1
    
    print "Answer: " + ' '.join(word_list)
    
    0 讨论(0)
  • 2021-01-13 12:01

    I love itertools. It seems like every time I want to write something, itertools already has it. In this case, groupby takes a list and groups repeated, sequential items from that list into a tuple of (item_value, iterator_of_those_values). Use it here like:

    >>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
    >>> ' '.join(item[0] for item in groupby(s.split()))
    'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu wool'
    

    So let's extend that a little with a function that returns a list with its duplicated repeated values removed:

    from itertools import chain, groupby
    
    def dedupe(lst):
        return list(chain(*[item[0] for item in groupby(lst)]))
    

    That's great for one-word phrases, but not helpful for longer phrases. What to do? Well, first, we'll want to check for longer phrases by striding over our original phrase:

    def stride(lst, offset, length):
        if offset:
            yield lst[:offset]
    
        while True:
            yield lst[offset:offset + length]
            offset += length
            if offset >= len(lst):
                return
    

    Now we're cooking! OK. So our strategy here is to first remove all the single-word duplicates. Next, we'll remove the two-word duplicates, starting from offset 0 then 1. After that, three-word duplicates starting at offsets 0, 1, and 2, and so on until we've hit five-word duplicates:

    def cleanse(list_of_words, max_phrase_length):
        for length in range(1, max_phrase_length + 1):
            for offset in range(length):
                list_of_words = dedupe(stride(list_of_words, offset, length))
    
        return list_of_words
    

    Putting it all together:

    from itertools import chain, groupby
    
    def stride(lst, offset, length):
        if offset:
            yield lst[:offset]
    
        while True:
            yield lst[offset:offset + length]
            offset += length
            if offset >= len(lst):
                return
    
    def dedupe(lst):
        return list(chain(*[item[0] for item in groupby(lst)]))
    
    def cleanse(list_of_words, max_phrase_length):
        for length in range(1, max_phrase_length + 1):
            for offset in range(length):
                list_of_words = dedupe(stride(list_of_words, offset, length))
    
        return list_of_words
    
    a = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .'
    
    b = 'this is a sentence where phrases duplicate . sentence are not prhases .'
    
    print ' '.join(cleanse(a.split(), 5)) == b
    
    0 讨论(0)
  • 2021-01-13 12:04

    With a pattern similar to sharcashmo's pattern, you can use subn that returns the number of replacements, inside a while loop :

    import re
    
    txt = r'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not phrases .'
    
    pattern = re.compile(r'(\b\w+(?: \w+)*)(?: \1)+\b')
    repl = r'\1'
    
    res = txt
    
    while True:
        [res, nbr] = pattern.subn(repl, res)
        if (nbr == 0):
            break
    
    print res
    

    When there is no more replacements the while loop stops.

    With this method you can get all overlapped matches (that is impossible with a single pass in a replacement context), without testing two times the same pattern.

    0 讨论(0)
  • 2021-01-13 12:18
    txt1 = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
    txt2 =  'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate'
    
    def remove_duplicates(txt):
        result = []
        for word in txt.split():
            if word not in result:
                result.append(word)
        return ' '.join(result)
    

    Ouput:

    In [7]: remove_duplicate_words(txt1)                                                                                                                                  
    Out[7]: 'this is a foo bar black sheep , have you any wool woo yes sir three bag wu'                                                                                  
    
    In [8]: remove_duplicate_words(txt2)                                                                                                                                 
    Out[8]: 'this is a sentence where phrases duplicate' 
    
    0 讨论(0)
  • 2021-01-13 12:18

    This should fix any number of adjacent duplicates, and works with both of your examples. I convert the string to a list, fix it, then convert back to a string for output:

    mywords = "foo foo bar bar foo bar"
    list = mywords.split()
    def remove_adjacent_dups(alist):
        result = []
        most_recent_elem = None
        for e in alist:
            if e != most_recent_elem:
                result.append(e)
                most_recent_elem = e
        to_string = ' '.join(result)
        return to_string
    
    print remove_adjacent_dups(list)
    

    Output:

    foo bar foo bar
    
    0 讨论(0)
提交回复
热议问题