Remove substrings inside a list with better than O(n^2) complexity

后端 未结 4 1502
清酒与你
清酒与你 2021-02-02 18:21

I have a list with many words (100.000+), and what I\'d like to do is remove all the substrings of every word in the list.

So for simplicity, let\'s imagine that I have

相关标签:
4条回答
  • 2021-02-02 18:39

    Build the set of all (unique) substrings first, then filter the words with it:

    def substrings(s):
        length = len(s)
        return {s[i:j + 1] for i in range(length) for j in range(i, length)} - {s}
    
    
    def remove_substrings(words):
        subs = set()
        for word in words:
            subs |= substrings(word)
    
        return set(w for w in words if w not in subs)
    
    0 讨论(0)
  • 2021-02-02 18:52

    You can sort your data by length, and then use a list comprehension:

    words = ['Hello', 'Hell', 'Apple', 'Banana', 'Ban', 'Peter', 'P', 'e']
    new_words = sorted(words, key=len, reverse=True)
    final_results = [a for i, a in enumerate(new_words) if not any(a in c for c in new_words[:i])]
    

    Output:

    ['Banana', 'Hello', 'Apple', 'Peter']
    
    0 讨论(0)
  • 2021-02-02 18:58

    Note that using for is slow in python in general, (you may use numpy arrays or NLP package), aside from that, how about this:

    words = list(set(words))#elimnate dublicates
    str_words = str(words)
    r=[]
    for x in words:
        if str_words.find(x)!=str_words.rfind(x):continue
        else:r.append(x)
    print(r)
    

    as I am answering here, I don't see a reason why c++ wouldn't be an option

    0 讨论(0)
  • 2021-02-02 19:01

    @wim is correct.

    Given an alphabet of fixed length, the following algorithm is linear in the overall length of text. If the alphabet is of unbounded size, then it will be O(n log(n)) instead. Either way it is better than O(n^2).

    Create an empty suffix tree T.
    Create an empty list filtered_words
    For word in words:
        if word not in T:
            Build suffix tree S for word (using Ukkonen's algorithm)
            Merge S into T
            append word to filtered_words
    
    0 讨论(0)
提交回复
热议问题