Is it worth using Python's re.compile?

前端 未结 26 1826
旧时难觅i
旧时难觅i 2020-11-22 12:51

Is there any benefit in using compile for regular expressions in Python?

h = re.compile(\'hello\')
h.match(\'hello world\')

vs



        
相关标签:
26条回答
  • 2020-11-22 13:24

    Besides the performance.

    Using compile helps me to distinguish the concepts of
    1. module(re),
    2. regex object
    3. match object
    When I started learning regex

    #regex object
    regex_object = re.compile(r'[a-zA-Z]+')
    #match object
    match_object = regex_object.search('1.Hello')
    #matching content
    match_object.group()
    output:
    Out[60]: 'Hello'
    V.S.
    re.search(r'[a-zA-Z]+','1.Hello').group()
    Out[61]: 'Hello'
    

    As a complement, I made an exhaustive cheatsheet of module re for your reference.

    regex = {
    'brackets':{'single_character': ['[]', '.', {'negate':'^'}],
                'capturing_group' : ['()','(?:)', '(?!)' '|', '\\', 'backreferences and named group'],
                'repetition'      : ['{}', '*?', '+?', '??', 'greedy v.s. lazy ?']},
    'lookaround' :{'lookahead'  : ['(?=...)', '(?!...)'],
                'lookbehind' : ['(?<=...)','(?<!...)'],
                'caputuring' : ['(?P<name>...)', '(?P=name)', '(?:)'],},
    'escapes':{'anchor'          : ['^', '\b', '$'],
              'non_printable'   : ['\n', '\t', '\r', '\f', '\v'],
              'shorthand'       : ['\d', '\w', '\s']},
    'methods': {['search', 'match', 'findall', 'finditer'],
                  ['split', 'sub']},
    'match_object': ['group','groups', 'groupdict','start', 'end', 'span',]
    }
    
    0 讨论(0)
  • 2020-11-22 13:24

    Legibility/cognitive load preference

    To me, the main gain is that I only need to remember, and read, one form of the complicated regex API syntax - the <compiled_pattern>.method(xxx) form rather than that and the re.func(<pattern>, xxx) form.

    The re.compile(<pattern>) is a bit of extra boilerplate, true.

    But where regex are concerned, that extra compile step is unlikely to be a big cause of cognitive load. And in fact, on complicated patterns, you might even gain clarity from separating the declaration from whatever regex method you then invoke on it.

    I tend to first tune complicated patterns in a website like Regex101, or even in a separate minimal test script, then bring them into my code, so separating the declaration from its use fits my workflow as well.

    0 讨论(0)
  • 2020-11-22 13:26

    This answer might be arriving late but is an interesting find. Using compile can really save you time if you are planning on using the regex multiple times (this is also mentioned in the docs). Below you can see that using a compiled regex is the fastest when the match method is directly called on it. passing a compiled regex to re.match makes it even slower and passing re.match with the patter string is somewhere in the middle.

    >>> ipr = r'\D+((([0-2][0-5]?[0-5]?)\.){3}([0-2][0-5]?[0-5]?))\D+'
    >>> average(*timeit.repeat("re.match(ipr, 'abcd100.10.255.255 ')", globals={'ipr': ipr, 're': re}))
    1.5077415757028423
    >>> ipr = re.compile(ipr)
    >>> average(*timeit.repeat("re.match(ipr, 'abcd100.10.255.255 ')", globals={'ipr': ipr, 're': re}))
    1.8324008992184038
    >>> average(*timeit.repeat("ipr.match('abcd100.10.255.255 ')", globals={'ipr': ipr, 're': re}))
    0.9187896518778871
    
    0 讨论(0)
  • 2020-11-22 13:26

    My understanding is that those two examples are effectively equivalent. The only difference is that in the first, you can reuse the compiled regular expression elsewhere without causing it to be compiled again.

    Here's a reference for you: http://diveintopython3.ep.io/refactoring.html

    Calling the compiled pattern object's search function with the string 'M' accomplishes the same thing as calling re.search with both the regular expression and the string 'M'. Only much, much faster. (In fact, the re.search function simply compiles the regular expression and calls the resulting pattern object's search method for you.)

    0 讨论(0)
  • 2020-11-22 13:29

    Performance difference aside, using re.compile and using the compiled regular expression object to do match (whatever regular expression related operations) makes the semantics clearer to Python run-time.

    I had some painful experience of debugging some simple code:

    compare = lambda s, p: re.match(p, s)
    

    and later I'd use compare in

    [x for x in data if compare(patternPhrases, x[columnIndex])]
    

    where patternPhrases is supposed to be a variable containing regular expression string, x[columnIndex] is a variable containing string.

    I had trouble that patternPhrases did not match some expected string!

    But if I used the re.compile form:

    compare = lambda s, p: p.match(s)
    

    then in

    [x for x in data if compare(patternPhrases, x[columnIndex])]
    

    Python would have complained that "string does not have attribute of match", as by positional argument mapping in compare, x[columnIndex] is used as regular expression!, when I actually meant

    compare = lambda p, s: p.match(s)
    

    In my case, using re.compile is more explicit of the purpose of regular expression, when it's value is hidden to naked eyes, thus I could get more help from Python run-time checking.

    So the moral of my lesson is that when the regular expression is not just literal string, then I should use re.compile to let Python to help me to assert my assumption.

    0 讨论(0)
  • 2020-11-22 13:30

    I agree with Honest Abe that the match(...) in the given examples are different. They are not a one-to-one comparisons and thus, outcomes are vary. To simplify my reply, I use A, B, C, D for those functions in question. Oh yes, we are dealing with 4 functions in re.py instead of 3.

    Running this piece of code:

    h = re.compile('hello')                   # (A)
    h.match('hello world')                    # (B)
    

    is same as running this code:

    re.match('hello', 'hello world')          # (C)
    

    Because, when looked into the source re.py, (A + B) means:

    h = re._compile('hello')                  # (D)
    h.match('hello world')
    

    and (C) is actually:

    re._compile('hello').match('hello world')
    

    So, (C) is not the same as (B). In fact, (C) calls (B) after calling (D) which is also called by (A). In other words, (C) = (A) + (B). Therefore, comparing (A + B) inside a loop has same result as (C) inside a loop.

    George's regexTest.py proved this for us.

    noncompiled took 4.555 seconds.           # (C) in a loop
    compiledInLoop took 4.620 seconds.        # (A + B) in a loop
    compiled took 2.323 seconds.              # (A) once + (B) in a loop
    

    Everyone's interest is, how to get the result of 2.323 seconds. In order to make sure compile(...) only get called once, we need to store the compiled regex object in memory. If we are using a class, we could store the object and reuse when every time our function get called.

    class Foo:
        regex = re.compile('hello')
        def my_function(text)
            return regex.match(text)
    

    If we are not using class (which is my request today), then I have no comment. I'm still learning to use global variable in Python, and I know global variable is a bad thing.

    One more point, I believe that using (A) + (B) approach has an upper hand. Here are some facts as I observed (please correct me if I'm wrong):

    1. Calls A once, it will do one search in the _cache followed by one sre_compile.compile() to create a regex object. Calls A twice, it will do two searches and one compile (because the regex object is cached).

    2. If the _cache get flushed in between, then the regex object is released from memory and Python need to compile again. (someone suggest that Python won't recompile.)

    3. If we keep the regex object by using (A), the regex object will still get into _cache and get flushed somehow. But our code keep a reference on it and the regex object will not be released from memory. Those, Python need not to compile again.

    4. The 2 seconds differences in George's test compiledInLoop vs compiled is mainly the time required to build the key and search the _cache. It doesn't mean the compile time of regex.

    5. George's reallycompile test show what happen if it really re-do the compile every time: it will be 100x slower (he reduced the loop from 1,000,000 to 10,000).

    Here are the only cases that (A + B) is better than (C):

    1. If we can cache a reference of the regex object inside a class.
    2. If we need to calls (B) repeatedly (inside a loop or multiple times), we must cache the reference to regex object outside the loop.

    Case that (C) is good enough:

    1. We cannot cache a reference.
    2. We only use it once in a while.
    3. In overall, we don't have too many regex (assume the compiled one never get flushed)

    Just a recap, here are the A B C:

    h = re.compile('hello')                   # (A)
    h.match('hello world')                    # (B)
    re.match('hello', 'hello world')          # (C)
    

    Thanks for reading.

    0 讨论(0)
提交回复
热议问题