Efficiently make many multiple substitutions in a string

后端 未结 1 880
闹比i
闹比i 2021-01-07 01:32

People have addressed before how to make multiple substitutions in a string based on a dictionary (see, for example). There seems to be a group of options based on str

相关标签:
1条回答
  • 2021-01-07 01:56

    As stated before, there are different approaches, each with different advantages. I am using three different situations for comparison.

    1. Short dictionary (847 substitution pairs)
    2. Medium dictionary (2528 pairs)
    3. Long dictionary (80430 pairs)

    For dictionaries 1 and 2 (shorter ones) I repeat each method 50 times in a loop, to get a more consistent timing. With the longer one a single pass for one document takes long enough (sadly). I tested 1 and 2 using the online service tio with Python 3.8. The long one was tested in my laptop with Python 3.6. Only relative performance between methods is relevant, so the minor specifics are not important.

    My string is between 28k and 29k characters.

    All times given in seconds.


    UPDATE: Flashtext

    A colleague found Flashtext, a Python library that specializes precisely in this. It allows searching by query and also applying substitutions. It is about two orders of magnitude faster than other alternatives. In the experiment 3 my current best time was 1.8 seconds. Flashtext takes 0.015 seconds.


    Regular Expressions

    There are many variations, but the best tend to be very similar to this:

    import re
    rep = dict((re.escape(k), v) for k, v in my_dict.items())
    pattern = re.compile("|".join(rep.keys()))
    new_string = pattern.sub(lambda m: rep[re.escape(m.group(0))], string)
    

    Execution times were:

    1. 1.63
    2. 5.03
    3. 7.7


    Replace

    This method simply applies string.replace in a loop. (Later I talk about problems with this.)

    for original, replacement in self.my_dict.items():
        string = string.replace(original, replacement)
    

    This solution proposes a variation using reduce, that applies a Lambda expression iteratively. This is best understood with an example from the official documentation. The expression

    reduce(lambda x, y: x+y, [1, 2, 3, 4, 5])
    

    equals ((((1+2)+3)+4)+5)

    import functools
    new_string = functools.reduce(lambda a, k: a.replace(*k), 
                                  my_dict.items(), string)
    

    Python 3.8 allows assignment expressions, as in this method. In its core this also relies on string.replace.

    [string := string.replace(f' {a} ', f' {b} ') for a, b in my_dict.items()]
    

    Execution times were (in parenthesis results for reduce and assignment expressions variants):

    1. 1.37 (1.39) (1.50)
    2. 4.10 (4.12) (4.07)
    3. 1.9 (1.8) (no Python 3.8 in machine)


    Recursive Lambda

    This proposal involves using a recursive Lambda.

    mrep = lambda s, d: s if not d else mrep(s.replace(*d.popitem()), d)
    new_string = mrep(string, my_dict)
    

    Execution times were:

    1. 0.07
    2. RecursionError
    3. RecursionError


    Practical remarks

    See the update above: Flashtext is much faster than the other alternatives.

    You can see from the execution times that the recursive approach is clearly the fastest, but it only works with small dictionaries. It is not recommended to increase the recursion depth much in Python, so this approach is entirely discarded for longer dictionaries.

    Regular expressions offer more control over your substitutions. For example, you may use \b before or after an element to ensure that there are no word characters at that side of the target substring (to prevent {'a': '1'} to be applied to 'apple'). The cost is that performance drops sharply for longer dictionaries, taking almost four times as long as other options.

    Assignment expressions, reduce and simply looping replace offer similar performance (assignment expressions could not be tested with the longer dictionary). Taking readability into account, string.replace seems like the best option. The problem with this, compared to regular expressions, is that substitutions happen sequentially, not in a single pass. So {'a': 'b', 'b': 'c'} returns 'c' for string 'a'. Dictionaries are now ordered in Python (but you may want to keep using OrderedDict) so you can set the order of substitutions carefully to avoid problems. Of course, with 80k substitutions you cannot rely on this.

    I am currently using a loop with replace, and doing some preprocessing to minimize trouble. I am adding spaces at both sides of punctuation (also in the dictionary for items containing punctuation). Then I can search for substrings surrounded by spaces, and insert substitutions with spaces as well. This also works when your targets are multiple words:

    string = 'This is: an island'
    my_dict = {'is': 'is not', 'an island': 'a museum'}
    

    Using replace and regular expressions I get string = ' This is : an island ' so that my replace loop

    for original, replacement in self.my_dict.items():
        string = string.replace(f' {original} ', f' {replacement} ')
    

    returns ' This is not : a museum ' as intended. Note that 'is' in 'This' and 'island' were left alone. Regular expressions could be used to fix punctuation back, although I don't require this step.

    0 讨论(0)
提交回复
热议问题