Most efficient way to remove multiple substrings from string?

后端 未结 1 1018
[愿得一人]
[愿得一人] 2021-01-01 18:42

What\'s the most efficient method to remove a list of substrings from a string?

I\'d like a cleaner, quicker way to do the following:

words = \'word1         


        
相关标签:
1条回答
  • 2021-01-01 19:03

    Regex:

    >>> import re
    >>> re.sub(r'|'.join(map(re.escape, replace_list)), '', words)
    ' word2  word4, '
    

    The above one-liner is actually not as fast as your string.replace version, but definitely shorter:

    >>> words = ' '.join([hashlib.sha1(str(random.random())).hexdigest()[:10] for _ in xrange(10000)])
    >>> replace_list = words.split()[:1000]
    >>> random.shuffle(replace_list)
    >>> %timeit remove_multiple_strings(words, replace_list)
    10 loops, best of 3: 49.4 ms per loop
    >>> %timeit re.sub(r'|'.join(map(re.escape, replace_list)), '', words)
    1 loops, best of 3: 623 ms per loop
    

    Gosh! Almost 12x slower.

    But can we improve it? Yes.

    As we are only concerned with words what we can do is simply filter out words from the words string using \w+ and compare it against a set of replace_list(yes an actual set: set(replace_list)):

    >>> def sub(m):
        return '' if m.group() in s else m.group()
    >>> %%timeit
    s = set(replace_list)
    re.sub(r'\w+', sub, words)
    ...
    100 loops, best of 3: 7.8 ms per loop
    

    For even larger string and words the string.replace approach and my first solution will end up taking quadratic time, but the solution should run in linear time.

    0 讨论(0)
提交回复
热议问题