Speed up millions of regex replacements in Python 3

后端 未结 9 1191
醉酒成梦
醉酒成梦 2020-11-22 05:44

I\'m using Python 3.5.2

I have two lists

  • a list of about 750,000 \"sentences\" (long strings)
  • a list of about 20,000 \"words\" that I would l
相关标签:
9条回答
  • 2020-11-22 06:09

    Perhaps Python is not the right tool here. Here is one with the Unix toolchain

    sed G file         |
    tr ' ' '\n'        |
    grep -vf blacklist |
    awk -v RS= -v OFS=' ' '{$1=$1}1'
    

    assuming your blacklist file is preprocessed with the word boundaries added. The steps are: convert the file to double spaced, split each sentence to one word per line, mass delete the blacklist words from the file, and merge back the lines.

    This should run at least an order of magnitude faster.

    For preprocessing the blacklist file from words (one word per line)

    sed 's/.*/\\b&\\b/' words > blacklist
    
    0 讨论(0)
  • 2020-11-22 06:12

    How about this:

    #!/usr/bin/env python3
    
    from __future__ import unicode_literals, print_function
    import re
    import time
    import io
    
    def replace_sentences_1(sentences, banned_words):
        # faster on CPython, but does not use \b as the word separator
        # so result is slightly different than replace_sentences_2()
        def filter_sentence(sentence):
            words = WORD_SPLITTER.split(sentence)
            words_iter = iter(words)
            for word in words_iter:
                norm_word = word.lower()
                if norm_word not in banned_words:
                    yield word
                yield next(words_iter) # yield the word separator
    
        WORD_SPLITTER = re.compile(r'(\W+)')
        banned_words = set(banned_words)
        for sentence in sentences:
            yield ''.join(filter_sentence(sentence))
    
    
    def replace_sentences_2(sentences, banned_words):
        # slower on CPython, uses \b as separator
        def filter_sentence(sentence):
            boundaries = WORD_BOUNDARY.finditer(sentence)
            current_boundary = 0
            while True:
                last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
                yield sentence[last_word_boundary:current_boundary] # yield the separators
                last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
                word = sentence[last_word_boundary:current_boundary]
                norm_word = word.lower()
                if norm_word not in banned_words:
                    yield word
    
        WORD_BOUNDARY = re.compile(r'\b')
        banned_words = set(banned_words)
        for sentence in sentences:
            yield ''.join(filter_sentence(sentence))
    
    
    corpus = io.open('corpus2.txt').read()
    banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
    sentences = corpus.split('. ')
    output = io.open('output.txt', 'wb')
    print('number of sentences:', len(sentences))
    start = time.time()
    for sentence in replace_sentences_1(sentences, banned_words):
        output.write(sentence.encode('utf-8'))
        output.write(b' .')
    print('time:', time.time() - start)
    

    These solutions splits on word boundaries and looks up each word in a set. They should be faster than re.sub of word alternates (Liteyes' solution) as these solutions are O(n) where n is the size of the input due to the amortized O(1) set lookup, while using regex alternates would cause the regex engine to have to check for word matches on every characters rather than just on word boundaries. My solutiona take extra care to preserve the whitespaces that was used in the original text (i.e. it doesn't compress whitespaces and preserves tabs, newlines, and other whitespace characters), but if you decide that you don't care about it, it should be fairly straightforward to remove them from the output.

    I tested on corpus.txt, which is a concatenation of multiple eBooks downloaded from the Gutenberg Project, and banned_words.txt is 20000 words randomly picked from Ubuntu's wordlist (/usr/share/dict/american-english). It takes around 30 seconds to process 862462 sentences (and half of that on PyPy). I've defined sentences as anything separated by ". ".

    $ # replace_sentences_1()
    $ python3 filter_words.py 
    number of sentences: 862462
    time: 24.46173644065857
    $ pypy filter_words.py 
    number of sentences: 862462
    time: 15.9370770454
    
    $ # replace_sentences_2()
    $ python3 filter_words.py 
    number of sentences: 862462
    time: 40.2742919921875
    $ pypy filter_words.py 
    number of sentences: 862462
    time: 13.1190629005
    

    PyPy particularly benefit more from the second approach, while CPython fared better on the first approach. The above code should work on both Python 2 and 3.

    0 讨论(0)
  • 2020-11-22 06:13

    TLDR

    Use this method (with set lookup) if you want the fastest solution. For a dataset similar to the OP's, it's approximately 2000 times faster than the accepted answer.

    If you insist on using a regex for lookup, use this trie-based version, which is still 1000 times faster than a regex union.

    Theory

    If your sentences aren't humongous strings, it's probably feasible to process many more than 50 per second.

    If you save all the banned words into a set, it will be very fast to check if another word is included in that set.

    Pack the logic into a function, give this function as argument to re.sub and you're done!

    Code

    import re
    with open('/usr/share/dict/american-english') as wordbook:
        banned_words = set(word.strip().lower() for word in wordbook)
    
    
    def delete_banned_words(matchobj):
        word = matchobj.group(0)
        if word.lower() in banned_words:
            return ""
        else:
            return word
    
    sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
                 "GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000
    
    word_pattern = re.compile('\w+')
    
    for sentence in sentences:
        sentence = word_pattern.sub(delete_banned_words, sentence)
    

    Converted sentences are:

    ' .  !
      .
    GiraffeElephantBoat
    sfgsdg sdwerha aswertwe
    

    Note that:

    • the search is case-insensitive (thanks to lower())
    • replacing a word with "" might leave two spaces (as in your code)
    • With python3, \w+ also matches accented characters (e.g. "ångström").
    • Any non-word character (tab, space, newline, marks, ...) will stay untouched.

    Performance

    There are a million sentences, banned_words has almost 100000 words and the script runs in less than 7s.

    In comparison, Liteye's answer needed 160s for 10 thousand sentences.

    With n being the total amound of words and m the amount of banned words, OP's and Liteye's code are O(n*m).

    In comparison, my code should run in O(n+m). Considering that there are many more sentences than banned words, the algorithm becomes O(n).

    Regex union test

    What's the complexity of a regex search with a '\b(word1|word2|...|wordN)\b' pattern? Is it O(N) or O(1)?

    It's pretty hard to grasp the way the regex engine works, so let's write a simple test.

    This code extracts 10**i random english words into a list. It creates the corresponding regex union, and tests it with different words :

    • one is clearly not a word (it begins with #)
    • one is the first word in the list
    • one is the last word in the list
    • one looks like a word but isn't


    import re
    import timeit
    import random
    
    with open('/usr/share/dict/american-english') as wordbook:
        english_words = [word.strip().lower() for word in wordbook]
        random.shuffle(english_words)
    
    print("First 10 words :")
    print(english_words[:10])
    
    test_words = [
        ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
        ("First word", english_words[0]),
        ("Last word", english_words[-1]),
        ("Almost a word", "couldbeaword")
    ]
    
    
    def find(word):
        def fun():
            return union.match(word)
        return fun
    
    for exp in range(1, 6):
        print("\nUnion of %d words" % 10**exp)
        union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
        for description, test_word in test_words:
            time = timeit.timeit(find(test_word), number=1000) * 1000
            print("  %-17s : %.1fms" % (description, time))
    

    It outputs:

    First 10 words :
    ["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']
    
    Union of 10 words
      Surely not a word : 0.7ms
      First word        : 0.8ms
      Last word         : 0.7ms
      Almost a word     : 0.7ms
    
    Union of 100 words
      Surely not a word : 0.7ms
      First word        : 1.1ms
      Last word         : 1.2ms
      Almost a word     : 1.2ms
    
    Union of 1000 words
      Surely not a word : 0.7ms
      First word        : 0.8ms
      Last word         : 9.6ms
      Almost a word     : 10.1ms
    
    Union of 10000 words
      Surely not a word : 1.4ms
      First word        : 1.8ms
      Last word         : 96.3ms
      Almost a word     : 116.6ms
    
    Union of 100000 words
      Surely not a word : 0.7ms
      First word        : 0.8ms
      Last word         : 1227.1ms
      Almost a word     : 1404.1ms
    

    So it looks like the search for a single word with a '\b(word1|word2|...|wordN)\b' pattern has:

    • O(1) best case
    • O(n/2) average case, which is still O(n)
    • O(n) worst case

    These results are consistent with a simple loop search.

    A much faster alternative to a regex union is to create the regex pattern from a trie.

    0 讨论(0)
  • 2020-11-22 06:13

    Well, here's a quick and easy solution, with test set.

    Winning strategy:

    re.sub("\w+",repl,sentence) searches for words.

    "repl" can be a callable. I used a function that performs a dict lookup, and the dict contains the words to search and replace.

    This is the simplest and fastest solution (see function replace4 in example code below).

    Second best

    The idea is to split the sentences into words, using re.split, while conserving the separators to reconstruct the sentences later. Then, replacements are done with a simple dict lookup.

    (see function replace3 in example code below).

    Timings for example functions:

    replace1: 0.62 sentences/s
    replace2: 7.43 sentences/s
    replace3: 48498.03 sentences/s
    replace4: 61374.97 sentences/s (...and 240.000/s with PyPy)
    

    ...and code:

    #! /bin/env python3
    # -*- coding: utf-8
    
    import time, random, re
    
    def replace1( sentences ):
        for n, sentence in enumerate( sentences ):
            for search, repl in patterns:
                sentence = re.sub( "\\b"+search+"\\b", repl, sentence )
    
    def replace2( sentences ):
        for n, sentence in enumerate( sentences ):
            for search, repl in patterns_comp:
                sentence = re.sub( search, repl, sentence )
    
    def replace3( sentences ):
        pd = patterns_dict.get
        for n, sentence in enumerate( sentences ):
            #~ print( n, sentence )
            # Split the sentence on non-word characters.
            # Note: () in split patterns ensure the non-word characters ARE kept
            # and returned in the result list, so we don't mangle the sentence.
            # If ALL separators are spaces, use string.split instead or something.
            # Example:
            #~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")
            #~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
            words = re.split(r"([^\w]+)", sentence)
    
            # and... done.
            sentence = "".join( pd(w,w) for w in words )
    
            #~ print( n, sentence )
    
    def replace4( sentences ):
        pd = patterns_dict.get
        def repl(m):
            w = m.group()
            return pd(w,w)
    
        for n, sentence in enumerate( sentences ):
            sentence = re.sub(r"\w+", repl, sentence)
    
    
    
    # Build test set
    test_words = [ ("word%d" % _) for _ in range(50000) ]
    test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ]
    
    # Create search and replace patterns
    patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ]
    patterns_dict = dict( patterns )
    patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ]
    
    
    def test( func, num ):
        t = time.time()
        func( test_sentences[:num] )
        print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t)))
    
    print( "Sentences", len(test_sentences) )
    print( "Words    ", len(test_words) )
    
    test( replace1, 1 )
    test( replace2, 10 )
    test( replace3, 1000 )
    test( replace4, 1000 )
    

    Edit: You can also ignore lowercase when checking if you pass a lowercase list of Sentences and edit repl

    def replace4( sentences ):
    pd = patterns_dict.get
    def repl(m):
        w = m.group()
        return pd(w.lower(),w)
    
    0 讨论(0)
  • 2020-11-22 06:23

    Practical approach

    A solution described below uses a lot of memory to store all the text at the same string and to reduce complexity level. If RAM is an issue think twice before use it.

    With join/split tricks you can avoid loops at all which should speed up the algorithm.

  • Concatenate a sentences with a special delimeter which is not contained by the sentences:
  • merged_sentences = ' * '.join(sentences)
    

  • Compile a single regex for all the words you need to rid from the sentences using | "or" regex statement:
  • regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag
    

  • Subscript the words with the compiled regex and split it by the special delimiter character back to separated sentences:
  • clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')
    

    Performance

    "".join complexity is O(n). This is pretty intuitive but anyway there is a shortened quotation from a source:

    for (i = 0; i < seqlen; i++) {
        [...]
        sz += PyUnicode_GET_LENGTH(item);
    

    Therefore with join/split you have O(words) + 2*O(sentences) which is still linear complexity vs 2*O(N2) with the initial approach.


    b.t.w. don't use multithreading. GIL will block each operation because your task is strictly CPU bound so GIL have no chance to be released but each thread will send ticks concurrently which cause extra effort and even lead operation to infinity.

0 讨论(0)
  • 2020-11-22 06:24

    One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b".

    Because re relies on C code to do the actual matching, the savings can be dramatic.

    As @pvg pointed out in the comments, it also benefits from single pass matching.

    If your words are not regex, Eric's answer is faster.

    0 讨论(0)
  • 提交回复
    热议问题