Is it worth using Python's re.compile?

前端 未结 26 1823
旧时难觅i
旧时难觅i 2020-11-22 12:51

Is there any benefit in using compile for regular expressions in Python?

h = re.compile(\'hello\')
h.match(\'hello world\')

vs



        
相关标签:
26条回答
  • 2020-11-22 13:16

    Interestingly, compiling does prove more efficient for me (Python 2.5.2 on Win XP):

    import re
    import time
    
    rgx = re.compile('(\w+)\s+[0-9_]?\s+\w*')
    str = "average    2 never"
    a = 0
    
    t = time.time()
    
    for i in xrange(1000000):
        if re.match('(\w+)\s+[0-9_]?\s+\w*', str):
        #~ if rgx.match(str):
            a += 1
    
    print time.time() - t
    

    Running the above code once as is, and once with the two if lines commented the other way around, the compiled regex is twice as fast

    0 讨论(0)
  • 2020-11-22 13:17

    (months later) it's easy to add your own cache around re.match, or anything else for that matter --

    """ Re.py: Re.match = re.match + cache  
        efficiency: re.py does this already (but what's _MAXCACHE ?)
        readability, inline / separate: matter of taste
    """
    
    import re
    
    cache = {}
    _re_type = type( re.compile( "" ))
    
    def match( pattern, str, *opt ):
        """ Re.match = re.match + cache re.compile( pattern ) 
        """
        if type(pattern) == _re_type:
            cpat = pattern
        elif pattern in cache:
            cpat = cache[pattern]
        else:
            cpat = cache[pattern] = re.compile( pattern, *opt )
        return cpat.match( str )
    
    # def search ...
    

    A wibni, wouldn't it be nice if: cachehint( size= ), cacheinfo() -> size, hits, nclear ...

    0 讨论(0)
  • 2020-11-22 13:18

    I really respect all the above answers. From my opinion Yes! For sure it is worth to use re.compile instead of compiling the regex, again and again, every time.

    Using re.compile makes your code more dynamic, as you can call the already compiled regex, instead of compiling again and aagain. This thing benefits you in cases:

    1. Processor Efforts
    2. Time Complexity.
    3. Makes regex Universal.(can be used in findall, search, match)
    4. And makes your program looks cool.

    Example :

      example_string = "The room number of her room is 26A7B."
      find_alpha_numeric_string = re.compile(r"\b\w+\b")
    

    Using in Findall

     find_alpha_numeric_string.findall(example_string)
    

    Using in search

      find_alpha_numeric_string.search(example_string)
    

    Similarly you can use it for: Match and Substitute

    0 讨论(0)
  • 2020-11-22 13:19

    i'd like to motivate that pre-compiling is both conceptually and 'literately' (as in 'literate programming') advantageous. have a look at this code snippet:

    from re import compile as _Re
    
    class TYPO:
    
      def text_has_foobar( self, text ):
        return self._text_has_foobar_re_search( text ) is not None
      _text_has_foobar_re_search = _Re( r"""(?i)foobar""" ).search
    
    TYPO = TYPO()
    

    in your application, you'd write:

    from TYPO import TYPO
    print( TYPO.text_has_foobar( 'FOObar ) )
    

    this is about as simple in terms of functionality as it can get. because this is example is so short, i conflated the way to get _text_has_foobar_re_search all in one line. the disadvantage of this code is that it occupies a little memory for whatever the lifetime of the TYPO library object is; the advantage is that when doing a foobar search, you'll get away with two function calls and two class dictionary lookups. how many regexes are cached by re and the overhead of that cache are irrelevant here.

    compare this with the more usual style, below:

    import re
    
    class Typo:
    
      def text_has_foobar( self, text ):
        return re.compile( r"""(?i)foobar""" ).search( text ) is not None
    

    In the application:

    typo = Typo()
    print( typo.text_has_foobar( 'FOObar ) )
    

    I readily admit that my style is highly unusual for python, maybe even debatable. however, in the example that more closely matches how python is mostly used, in order to do a single match, we must instantiate an object, do three instance dictionary lookups, and perform three function calls; additionally, we might get into re caching troubles when using more than 100 regexes. also, the regular expression gets hidden inside the method body, which most of the time is not such a good idea.

    be it said that every subset of measures---targeted, aliased import statements; aliased methods where applicable; reduction of function calls and object dictionary lookups---can help reduce computational and conceptual complexity.

    0 讨论(0)
  • 2020-11-22 13:22

    I ran this test before stumbling upon the discussion here. However, having run it I thought I'd at least post my results.

    I stole and bastardized the example in Jeff Friedl's "Mastering Regular Expressions". This is on a macbook running OSX 10.6 (2Ghz intel core 2 duo, 4GB ram). Python version is 2.6.1.

    Run 1 - using re.compile

    import re 
    import time 
    import fpformat
    Regex1 = re.compile('^(a|b|c|d|e|f|g)+$') 
    Regex2 = re.compile('^[a-g]+$')
    TimesToDo = 1000
    TestString = "" 
    for i in range(1000):
        TestString += "abababdedfg"
    StartTime = time.time() 
    for i in range(TimesToDo):
        Regex1.search(TestString) 
    Seconds = time.time() - StartTime 
    print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"
    
    StartTime = time.time() 
    for i in range(TimesToDo):
        Regex2.search(TestString) 
    Seconds = time.time() - StartTime 
    print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"
    
    Alternation takes 2.299 seconds
    Character Class takes 0.107 seconds
    

    Run 2 - Not using re.compile

    import re 
    import time 
    import fpformat
    
    TimesToDo = 1000
    TestString = "" 
    for i in range(1000):
        TestString += "abababdedfg"
    StartTime = time.time() 
    for i in range(TimesToDo):
        re.search('^(a|b|c|d|e|f|g)+$',TestString) 
    Seconds = time.time() - StartTime 
    print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"
    
    StartTime = time.time() 
    for i in range(TimesToDo):
        re.search('^[a-g]+$',TestString) 
    Seconds = time.time() - StartTime 
    print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"
    
    Alternation takes 2.508 seconds
    Character Class takes 0.109 seconds
    
    0 讨论(0)
  • 2020-11-22 13:23

    Mostly, there is little difference whether you use re.compile or not. Internally, all of the functions are implemented in terms of a compile step:

    def match(pattern, string, flags=0):
        return _compile(pattern, flags).match(string)
    
    def fullmatch(pattern, string, flags=0):
        return _compile(pattern, flags).fullmatch(string)
    
    def search(pattern, string, flags=0):
        return _compile(pattern, flags).search(string)
    
    def sub(pattern, repl, string, count=0, flags=0):
        return _compile(pattern, flags).sub(repl, string, count)
    
    def subn(pattern, repl, string, count=0, flags=0):
        return _compile(pattern, flags).subn(repl, string, count)
    
    def split(pattern, string, maxsplit=0, flags=0):
        return _compile(pattern, flags).split(string, maxsplit)
    
    def findall(pattern, string, flags=0):
        return _compile(pattern, flags).findall(string)
    
    def finditer(pattern, string, flags=0):
        return _compile(pattern, flags).finditer(string)
    

    In addition, re.compile() bypasses the extra indirection and caching logic:

    _cache = {}
    
    _pattern_type = type(sre_compile.compile("", 0))
    
    _MAXCACHE = 512
    def _compile(pattern, flags):
        # internal: compile pattern
        try:
            p, loc = _cache[type(pattern), pattern, flags]
            if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
                return p
        except KeyError:
            pass
        if isinstance(pattern, _pattern_type):
            if flags:
                raise ValueError(
                    "cannot process flags argument with a compiled pattern")
            return pattern
        if not sre_compile.isstring(pattern):
            raise TypeError("first argument must be string or compiled pattern")
        p = sre_compile.compile(pattern, flags)
        if not (flags & DEBUG):
            if len(_cache) >= _MAXCACHE:
                _cache.clear()
            if p.flags & LOCALE:
                if not _locale:
                    return p
                loc = _locale.setlocale(_locale.LC_CTYPE)
            else:
                loc = None
            _cache[type(pattern), pattern, flags] = p, loc
        return p
    

    In addition to the small speed benefit from using re.compile, people also like the readability that comes from naming potentially complex pattern specifications and separating them from the business logic where there are applied:

    #### Patterns ############################################################
    number_pattern = re.compile(r'\d+(\.\d*)?')    # Integer or decimal number
    assign_pattern = re.compile(r':=')             # Assignment operator
    identifier_pattern = re.compile(r'[A-Za-z]+')  # Identifiers
    whitespace_pattern = re.compile(r'[\t ]+')     # Spaces and tabs
    
    #### Applications ########################################################
    
    if whitespace_pattern.match(s): business_logic_rule_1()
    if assign_pattern.match(s): business_logic_rule_2()
    

    Note, one other respondent incorrectly believed that pyc files stored compiled patterns directly; however, in reality they are rebuilt each time when the PYC is loaded:

    >>> from dis import dis
    >>> with open('tmp.pyc', 'rb') as f:
            f.read(8)
            dis(marshal.load(f))
    
      1           0 LOAD_CONST               0 (-1)
                  3 LOAD_CONST               1 (None)
                  6 IMPORT_NAME              0 (re)
                  9 STORE_NAME               0 (re)
    
      3          12 LOAD_NAME                0 (re)
                 15 LOAD_ATTR                1 (compile)
                 18 LOAD_CONST               2 ('[aeiou]{2,5}')
                 21 CALL_FUNCTION            1
                 24 STORE_NAME               2 (lc_vowels)
                 27 LOAD_CONST               1 (None)
                 30 RETURN_VALUE
    

    The above disassembly comes from the PYC file for a tmp.py containing:

    import re
    lc_vowels = re.compile(r'[aeiou]{2,5}')
    
    0 讨论(0)
提交回复
热议问题