Is it worth using Python's re.compile?

前端 未结 26 1822
旧时难觅i
旧时难觅i 2020-11-22 12:51

Is there any benefit in using compile for regular expressions in Python?

h = re.compile(\'hello\')
h.match(\'hello world\')

vs



        
相关标签:
26条回答
  • 2020-11-22 13:03

    FWIW:

    $ python -m timeit -s "import re" "re.match('hello', 'hello world')"
    100000 loops, best of 3: 3.82 usec per loop
    
    $ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
    1000000 loops, best of 3: 1.26 usec per loop
    

    so, if you're going to be using the same regex a lot, it may be worth it to do re.compile (especially for more complex regexes).

    The standard arguments against premature optimization apply, but I don't think you really lose much clarity/straightforwardness by using re.compile if you suspect that your regexps may become a performance bottleneck.

    Update:

    Under Python 3.6 (I suspect the above timings were done using Python 2.x) and 2018 hardware (MacBook Pro), I now get the following timings:

    % python -m timeit -s "import re" "re.match('hello', 'hello world')"
    1000000 loops, best of 3: 0.661 usec per loop
    
    % python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
    1000000 loops, best of 3: 0.285 usec per loop
    
    % python -m timeit -s "import re" "h=re.compile('hello'); h.match('hello world')"
    1000000 loops, best of 3: 0.65 usec per loop
    
    % python --version
    Python 3.6.5 :: Anaconda, Inc.
    

    I also added a case (notice the quotation mark differences between the last two runs) that shows that re.match(x, ...) is literally [roughly] equivalent to re.compile(x).match(...), i.e. no behind-the-scenes caching of the compiled representation seems to happen.

    0 讨论(0)
  • 2020-11-22 13:04

    Regular Expressions are compiled before being used when using the second version. If you are going to executing it many times it is definatly better to compile it first. If not compiling every time you match for one off's is fine.

    0 讨论(0)
  • 2020-11-22 13:05

    Using the given examples:

    h = re.compile('hello')
    h.match('hello world')
    

    The match method in the example above is not the same as the one used below:

    re.match('hello', 'hello world')
    

    re.compile() returns a regular expression object, which means h is a regex object.

    The regex object has its own match method with the optional pos and endpos parameters:

    regex.match(string[, pos[, endpos]])

    pos

    The optional second parameter pos gives an index in the string where the search is to start; it defaults to 0. This is not completely equivalent to slicing the string; the '^' pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start.

    endpos

    The optional parameter endpos limits how far the string will be searched; it will be as if the string is endpos characters long, so only the characters from pos to endpos - 1 will be searched for a match. If endpos is less than pos, no match will be found; otherwise, if rx is a compiled regular expression object, rx.search(string, 0, 50) is equivalent to rx.search(string[:50], 0).

    The regex object's search, findall, and finditer methods also support these parameters.

    re.match(pattern, string, flags=0) does not support them as you can see,
    nor does its search, findall, and finditer counterparts.

    A match object has attributes that complement these parameters:

    match.pos

    The value of pos which was passed to the search() or match() method of a regex object. This is the index into the string at which the RE engine started looking for a match.

    match.endpos

    The value of endpos which was passed to the search() or match() method of a regex object. This is the index into the string beyond which the RE engine will not go.


    A regex object has two unique, possibly useful, attributes:

    regex.groups

    The number of capturing groups in the pattern.

    regex.groupindex

    A dictionary mapping any symbolic group names defined by (?P) to group numbers. The dictionary is empty if no symbolic groups were used in the pattern.


    And finally, a match object has this attribute:

    match.re

    The regular expression object whose match() or search() method produced this match instance.

    0 讨论(0)
  • 2020-11-22 13:06

    I've had a lot of experience running a compiled regex 1000s of times versus compiling on-the-fly, and have not noticed any perceivable difference. Obviously, this is anecdotal, and certainly not a great argument against compiling, but I've found the difference to be negligible.

    EDIT: After a quick glance at the actual Python 2.5 library code, I see that Python internally compiles AND CACHES regexes whenever you use them anyway (including calls to re.match()), so you're really only changing WHEN the regex gets compiled, and shouldn't be saving much time at all - only the time it takes to check the cache (a key lookup on an internal dict type).

    From module re.py (comments are mine):

    def match(pattern, string, flags=0):
        return _compile(pattern, flags).match(string)
    
    def _compile(*key):
    
        # Does cache check at top of function
        cachekey = (type(key[0]),) + key
        p = _cache.get(cachekey)
        if p is not None: return p
    
        # ...
        # Does actual compilation on cache miss
        # ...
    
        # Caches compiled regex
        if len(_cache) >= _MAXCACHE:
            _cache.clear()
        _cache[cachekey] = p
        return p
    

    I still often pre-compile regular expressions, but only to bind them to a nice, reusable name, not for any expected performance gain.

    0 讨论(0)
  • 2020-11-22 13:06

    I just tried this myself. For the simple case of parsing a number out of a string and summing it, using a compiled regular expression object is about twice as fast as using the re methods.

    As others have pointed out, the re methods (including re.compile) look up the regular expression string in a cache of previously compiled expressions. Therefore, in the normal case, the extra cost of using the re methods is simply the cost of the cache lookup.

    However, examination of the code, shows the cache is limited to 100 expressions. This begs the question, how painful is it to overflow the cache? The code contains an internal interface to the regular expression compiler, re.sre_compile.compile. If we call it, we bypass the cache. It turns out to be about two orders of magnitude slower for a basic regular expression, such as r'\w+\s+([0-9_]+)\s+\w*'.

    Here's my test:

    #!/usr/bin/env python
    import re
    import time
    
    def timed(func):
        def wrapper(*args):
            t = time.time()
            result = func(*args)
            t = time.time() - t
            print '%s took %.3f seconds.' % (func.func_name, t)
            return result
        return wrapper
    
    regularExpression = r'\w+\s+([0-9_]+)\s+\w*'
    testString = "average    2 never"
    
    @timed
    def noncompiled():
        a = 0
        for x in xrange(1000000):
            m = re.match(regularExpression, testString)
            a += int(m.group(1))
        return a
    
    @timed
    def compiled():
        a = 0
        rgx = re.compile(regularExpression)
        for x in xrange(1000000):
            m = rgx.match(testString)
            a += int(m.group(1))
        return a
    
    @timed
    def reallyCompiled():
        a = 0
        rgx = re.sre_compile.compile(regularExpression)
        for x in xrange(1000000):
            m = rgx.match(testString)
            a += int(m.group(1))
        return a
    
    
    @timed
    def compiledInLoop():
        a = 0
        for x in xrange(1000000):
            rgx = re.compile(regularExpression)
            m = rgx.match(testString)
            a += int(m.group(1))
        return a
    
    @timed
    def reallyCompiledInLoop():
        a = 0
        for x in xrange(10000):
            rgx = re.sre_compile.compile(regularExpression)
            m = rgx.match(testString)
            a += int(m.group(1))
        return a
    
    r1 = noncompiled()
    r2 = compiled()
    r3 = reallyCompiled()
    r4 = compiledInLoop()
    r5 = reallyCompiledInLoop()
    print "r1 = ", r1
    print "r2 = ", r2
    print "r3 = ", r3
    print "r4 = ", r4
    print "r5 = ", r5
    </pre>
    And here is the output on my machine:
    <pre>
    $ regexTest.py 
    noncompiled took 4.555 seconds.
    compiled took 2.323 seconds.
    reallyCompiled took 2.325 seconds.
    compiledInLoop took 4.620 seconds.
    reallyCompiledInLoop took 4.074 seconds.
    r1 =  2000000
    r2 =  2000000
    r3 =  2000000
    r4 =  2000000
    r5 =  20000
    

    The 'reallyCompiled' methods use the internal interface, which bypasses the cache. Note the one that compiles on each loop iteration is only iterated 10,000 times, not one million.

    0 讨论(0)
  • 2020-11-22 13:08

    This is a good question. You often see people use re.compile without reason. It lessens readability. But sure there are lots of times when pre-compiling the expression is called for. Like when you use it repeated times in a loop or some such.

    It's like everything about programming (everything in life actually). Apply common sense.

    0 讨论(0)
提交回复
热议问题