Is it worth using Python's re.compile?

前端 未结 26 1824
旧时难觅i
旧时难觅i 2020-11-22 12:51

Is there any benefit in using compile for regular expressions in Python?

h = re.compile(\'hello\')
h.match(\'hello world\')

vs



        
相关标签:
26条回答
  • 2020-11-22 13:10

    For me, the biggest benefit to re.compile is being able to separate definition of the regex from its use.

    Even a simple expression such as 0|[1-9][0-9]* (integer in base 10 without leading zeros) can be complex enough that you'd rather not have to retype it, check if you made any typos, and later have to recheck if there are typos when you start debugging. Plus, it's nicer to use a variable name such as num or num_b10 than 0|[1-9][0-9]*.

    It's certainly possible to store strings and pass them to re.match; however, that's less readable:

    num = "..."
    # then, much later:
    m = re.match(num, input)
    

    Versus compiling:

    num = re.compile("...")
    # then, much later:
    m = num.match(input)
    

    Though it is fairly close, the last line of the second feels more natural and simpler when used repeatedly.

    0 讨论(0)
  • 2020-11-22 13:14

    Here is an example where using re.compile is over 50 times faster, as requested.

    The point is just the same as what I made in the comment above, namely, using re.compile can be a significant advantage when your usage is such as to not benefit much from the compilation cache. This happens at least in one particular case (that I ran into in practice), namely when all of the following are true:

    • You have a lot of regex patterns (more than re._MAXCACHE, whose default is currently 512), and
    • you use these regexes a lot of times, and
    • you consecutive usages of the same pattern are separated by more than re._MAXCACHE other regexes in between, so that each one gets flushed from the cache between consecutive usages.
    import re
    import time
    
    def setup(N=1000):
        # Patterns 'a.*a', 'a.*b', ..., 'z.*z'
        patterns = [chr(i) + '.*' + chr(j)
                        for i in range(ord('a'), ord('z') + 1)
                        for j in range(ord('a'), ord('z') + 1)]
        # If this assertion below fails, just add more (distinct) patterns.
        # assert(re._MAXCACHE < len(patterns))
        # N strings. Increase N for larger effect.
        strings = ['abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'] * N
        return (patterns, strings)
    
    def without_compile():
        print('Without re.compile:')
        patterns, strings = setup()
        print('searching')
        count = 0
        for s in strings:
            for pat in patterns:
                count += bool(re.search(pat, s))
        return count
    
    def without_compile_cache_friendly():
        print('Without re.compile, cache-friendly order:')
        patterns, strings = setup()
        print('searching')
        count = 0
        for pat in patterns:
            for s in strings:
                count += bool(re.search(pat, s))
        return count
    
    def with_compile():
        print('With re.compile:')
        patterns, strings = setup()
        print('compiling')
        compiled = [re.compile(pattern) for pattern in patterns]
        print('searching')
        count = 0
        for s in strings:
            for regex in compiled:
                count += bool(regex.search(s))
        return count
    
    start = time.time()
    print(with_compile())
    d1 = time.time() - start
    print(f'-- That took {d1:.2f} seconds.\n')
    
    start = time.time()
    print(without_compile_cache_friendly())
    d2 = time.time() - start
    print(f'-- That took {d2:.2f} seconds.\n')
    
    start = time.time()
    print(without_compile())
    d3 = time.time() - start
    print(f'-- That took {d3:.2f} seconds.\n')
    
    print(f'Ratio: {d3/d1:.2f}')
    

    Example output I get on my laptop (Python 3.7.7):

    With re.compile:
    compiling
    searching
    676000
    -- That took 0.33 seconds.
    
    Without re.compile, cache-friendly order:
    searching
    676000
    -- That took 0.67 seconds.
    
    Without re.compile:
    searching
    676000
    -- That took 23.54 seconds.
    
    Ratio: 70.89
    

    I didn't bother with timeit as the difference is so stark, but I get qualitatively similar numbers each time. Note that even without re.compile, using the same regex multiple times and moving on to the next one wasn't so bad (only about 2 times as slow as with re.compile), but in the other order (looping through many regexes), it is significantly worse, as expected. Also, increasing the cache size works too: simply setting re._MAXCACHE = len(patterns) in setup() above (of course I don't recommend doing such things in production as names with underscores are conventionally “private”) drops the ~23 seconds back down to ~0.7 seconds, which also matches our understanding.

    0 讨论(0)
  • 2020-11-22 13:14

    According to the Python documentation:

    The sequence

    prog = re.compile(pattern)
    result = prog.match(string)
    

    is equivalent to

    result = re.match(pattern, string)
    

    but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.

    So my conclusion is, if you are going to match the same pattern for many different texts, you better precompile it.

    0 讨论(0)
  • 2020-11-22 13:15

    As an alternative answer, as I see that it hasn't been mentioned before, I'll go ahead and quote the Python 3 docs:

    Should you use these module-level functions, or should you get the pattern and call its methods yourself? If you’re accessing a regex within a loop, pre-compiling it will save a few function calls. Outside of loops, there’s not much difference thanks to the internal cache.

    0 讨论(0)
  • 2020-11-22 13:15

    I've had a lot of experience running a compiled regex 1000s of times versus compiling on-the-fly, and have not noticed any perceivable difference

    The votes on the accepted answer leads to the assumption that what @Triptych says is true for all cases. This is not necessarily true. One big difference is when you have to decide whether to accept a regex string or a compiled regex object as a parameter to a function:

    >>> timeit.timeit(setup="""
    ... import re
    ... f=lambda x, y: x.match(y)       # accepts compiled regex as parameter
    ... h=re.compile('hello')
    ... """, stmt="f(h, 'hello world')")
    0.32881879806518555
    >>> timeit.timeit(setup="""
    ... import re
    ... f=lambda x, y: re.compile(x).match(y)   # compiles when called
    ... """, stmt="f('hello', 'hello world')")
    0.809190034866333
    

    It is always better to compile your regexs in case you need to reuse them.

    Note the example in the timeit above simulates creation of a compiled regex object once at import time versus "on-the-fly" when required for a match.

    0 讨论(0)
  • 2020-11-22 13:16

    Here's a simple test case:

    ~$ for x in 1 10 100 1000 10000 100000 1000000; do python -m timeit -n $x -s 'import re' 're.match("[0-9]{3}-[0-9]{3}-[0-9]{4}", "123-123-1234")'; done
    1 loops, best of 3: 3.1 usec per loop
    10 loops, best of 3: 2.41 usec per loop
    100 loops, best of 3: 2.24 usec per loop
    1000 loops, best of 3: 2.21 usec per loop
    10000 loops, best of 3: 2.23 usec per loop
    100000 loops, best of 3: 2.24 usec per loop
    1000000 loops, best of 3: 2.31 usec per loop
    

    with re.compile:

    ~$ for x in 1 10 100 1000 10000 100000 1000000; do python -m timeit -n $x -s 'import re' 'r = re.compile("[0-9]{3}-[0-9]{3}-[0-9]{4}")' 'r.match("123-123-1234")'; done
    1 loops, best of 3: 1.91 usec per loop
    10 loops, best of 3: 0.691 usec per loop
    100 loops, best of 3: 0.701 usec per loop
    1000 loops, best of 3: 0.684 usec per loop
    10000 loops, best of 3: 0.682 usec per loop
    100000 loops, best of 3: 0.694 usec per loop
    1000000 loops, best of 3: 0.702 usec per loop
    

    So, it would seem to compiling is faster with this simple case, even if you only match once.

    0 讨论(0)
提交回复
热议问题