Speed up millions of regex replacements in Python 3

后端 未结 9 1190
醉酒成梦
醉酒成梦 2020-11-22 05:44

I\'m using Python 3.5.2

I have two lists

  • a list of about 750,000 \"sentences\" (long strings)
  • a list of about 20,000 \"words\" that I would l
9条回答
  •  一向
    一向 (楼主)
    2020-11-22 06:12

    How about this:

    #!/usr/bin/env python3
    
    from __future__ import unicode_literals, print_function
    import re
    import time
    import io
    
    def replace_sentences_1(sentences, banned_words):
        # faster on CPython, but does not use \b as the word separator
        # so result is slightly different than replace_sentences_2()
        def filter_sentence(sentence):
            words = WORD_SPLITTER.split(sentence)
            words_iter = iter(words)
            for word in words_iter:
                norm_word = word.lower()
                if norm_word not in banned_words:
                    yield word
                yield next(words_iter) # yield the word separator
    
        WORD_SPLITTER = re.compile(r'(\W+)')
        banned_words = set(banned_words)
        for sentence in sentences:
            yield ''.join(filter_sentence(sentence))
    
    
    def replace_sentences_2(sentences, banned_words):
        # slower on CPython, uses \b as separator
        def filter_sentence(sentence):
            boundaries = WORD_BOUNDARY.finditer(sentence)
            current_boundary = 0
            while True:
                last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
                yield sentence[last_word_boundary:current_boundary] # yield the separators
                last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
                word = sentence[last_word_boundary:current_boundary]
                norm_word = word.lower()
                if norm_word not in banned_words:
                    yield word
    
        WORD_BOUNDARY = re.compile(r'\b')
        banned_words = set(banned_words)
        for sentence in sentences:
            yield ''.join(filter_sentence(sentence))
    
    
    corpus = io.open('corpus2.txt').read()
    banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
    sentences = corpus.split('. ')
    output = io.open('output.txt', 'wb')
    print('number of sentences:', len(sentences))
    start = time.time()
    for sentence in replace_sentences_1(sentences, banned_words):
        output.write(sentence.encode('utf-8'))
        output.write(b' .')
    print('time:', time.time() - start)
    

    These solutions splits on word boundaries and looks up each word in a set. They should be faster than re.sub of word alternates (Liteyes' solution) as these solutions are O(n) where n is the size of the input due to the amortized O(1) set lookup, while using regex alternates would cause the regex engine to have to check for word matches on every characters rather than just on word boundaries. My solutiona take extra care to preserve the whitespaces that was used in the original text (i.e. it doesn't compress whitespaces and preserves tabs, newlines, and other whitespace characters), but if you decide that you don't care about it, it should be fairly straightforward to remove them from the output.

    I tested on corpus.txt, which is a concatenation of multiple eBooks downloaded from the Gutenberg Project, and banned_words.txt is 20000 words randomly picked from Ubuntu's wordlist (/usr/share/dict/american-english). It takes around 30 seconds to process 862462 sentences (and half of that on PyPy). I've defined sentences as anything separated by ". ".

    $ # replace_sentences_1()
    $ python3 filter_words.py 
    number of sentences: 862462
    time: 24.46173644065857
    $ pypy filter_words.py 
    number of sentences: 862462
    time: 15.9370770454
    
    $ # replace_sentences_2()
    $ python3 filter_words.py 
    number of sentences: 862462
    time: 40.2742919921875
    $ pypy filter_words.py 
    number of sentences: 862462
    time: 13.1190629005
    

    PyPy particularly benefit more from the second approach, while CPython fared better on the first approach. The above code should work on both Python 2 and 3.

提交回复
热议问题