Is it worth using Python's re.compile?

前端未结

关注

 26  1856

Is there any benefit in using compile for regular expressions in Python?

h = re.compile(\'hello\')
h.match(\'hello world\')

相关标签:

26条回答

没有蜡笔的小新

2020-11-22 13:16
Interestingly, compiling does prove more efficient for me (Python 2.5.2 on Win XP):
```
import re
import time

rgx = re.compile('(\w+)\s+[0-9_]?\s+\w*')
str = "average    2 never"
a = 0

t = time.time()

for i in xrange(1000000):
    if re.match('(\w+)\s+[0-9_]?\s+\w*', str):
    #~ if rgx.match(str):
        a += 1

print time.time() - t
```
Running the above code once as is, and once with the two if lines commented the other way around, the compiled regex is twice as fast
0 讨论(0)
发布评论:

提交评论
- 加载中...

无人共我

2020-11-22 13:17

(months later) it's easy to add your own cache around re.match, or anything else for that matter --

""" Re.py: Re.match = re.match + cache  
    efficiency: re.py does this already (but what's _MAXCACHE ?)
    readability, inline / separate: matter of taste
"""

import re

cache = {}
_re_type = type( re.compile( "" ))

def match( pattern, str, *opt ):
    """ Re.match = re.match + cache re.compile( pattern ) 
    """
    if type(pattern) == _re_type:
        cpat = pattern
    elif pattern in cache:
        cpat = cache[pattern]
    else:
        cpat = cache[pattern] = re.compile( pattern, *opt )
    return cpat.match( str )

# def search ...

A wibni, wouldn't it be nice if: cachehint( size= ), cacheinfo() -> size, hits, nclear ...

0 讨论(0)

再見小時候

2020-11-22 13:18
I really respect all the above answers. From my opinion Yes! For sure it is worth to use re.compile instead of compiling the regex, again and again, every time.

Using re.compile makes your code more dynamic, as you can call the already compiled regex, instead of compiling again and aagain. This thing benefits you in cases:
1. Processor Efforts
2. Time Complexity.
3. Makes regex Universal.(can be used in findall, search, match)
4. And makes your program looks cool.
Example :
```
  example_string = "The room number of her room is 26A7B."
  find_alpha_numeric_string = re.compile(r"\b\w+\b")
```
Using in Findall
```
 find_alpha_numeric_string.findall(example_string)
```
Using in search
```
  find_alpha_numeric_string.search(example_string)
```
Similarly you can use it for: Match and Substitute
0 讨论(0)
发布评论:

提交评论
- 加载中...
没有蜡笔的小新

2020-11-22 13:19
i'd like to motivate that pre-compiling is both conceptually and 'literately' (as in 'literate programming') advantageous. have a look at this code snippet:
```
from re import compile as _Re

class TYPO:

  def text_has_foobar( self, text ):
    return self._text_has_foobar_re_search( text ) is not None
  _text_has_foobar_re_search = _Re( r"""(?i)foobar""" ).search

TYPO = TYPO()
```
in your application, you'd write:
```
from TYPO import TYPO
print( TYPO.text_has_foobar( 'FOObar ) )
```
this is about as simple in terms of functionality as it can get. because this is example is so short, i conflated the way to get _text_has_foobar_re_search all in one line. the disadvantage of this code is that it occupies a little memory for whatever the lifetime of the TYPO library object is; the advantage is that when doing a foobar search, you'll get away with two function calls and two class dictionary lookups. how many regexes are cached by re and the overhead of that cache are irrelevant here.

compare this with the more usual style, below:
```
import re

class Typo:

  def text_has_foobar( self, text ):
    return re.compile( r"""(?i)foobar""" ).search( text ) is not None
```
In the application:
```
typo = Typo()
print( typo.text_has_foobar( 'FOObar ) )
```
I readily admit that my style is highly unusual for python, maybe even debatable. however, in the example that more closely matches how python is mostly used, in order to do a single match, we must instantiate an object, do three instance dictionary lookups, and perform three function calls; additionally, we might get into re caching troubles when using more than 100 regexes. also, the regular expression gets hidden inside the method body, which most of the time is not such a good idea.

be it said that every subset of measures---targeted, aliased import statements; aliased methods where applicable; reduction of function calls and object dictionary lookups---can help reduce computational and conceptual complexity.
0 讨论(0)
发布评论:

提交评论
- 加载中...

别那么骄傲

2020-11-22 13:22

I ran this test before stumbling upon the discussion here. However, having run it I thought I'd at least post my results.

I stole and bastardized the example in Jeff Friedl's "Mastering Regular Expressions". This is on a macbook running OSX 10.6 (2Ghz intel core 2 duo, 4GB ram). Python version is 2.6.1.

Run 1 - using re.compile

import re 
import time 
import fpformat
Regex1 = re.compile('^(a|b|c|d|e|f|g)+$') 
Regex2 = re.compile('^[a-g]+$')
TimesToDo = 1000
TestString = "" 
for i in range(1000):
    TestString += "abababdedfg"
StartTime = time.time() 
for i in range(TimesToDo):
    Regex1.search(TestString) 
Seconds = time.time() - StartTime 
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"

StartTime = time.time() 
for i in range(TimesToDo):
    Regex2.search(TestString) 
Seconds = time.time() - StartTime 
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"

Alternation takes 2.299 seconds
Character Class takes 0.107 seconds

Run 2 - Not using re.compile

import re 
import time 
import fpformat

TimesToDo = 1000
TestString = "" 
for i in range(1000):
    TestString += "abababdedfg"
StartTime = time.time() 
for i in range(TimesToDo):
    re.search('^(a|b|c|d|e|f|g)+$',TestString) 
Seconds = time.time() - StartTime 
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"

StartTime = time.time() 
for i in range(TimesToDo):
    re.search('^[a-g]+$',TestString) 
Seconds = time.time() - StartTime 
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"

Alternation takes 2.508 seconds
Character Class takes 0.109 seconds

0 讨论(0)

旧巷少年郎

2020-11-22 13:23

Mostly, there is little difference whether you use re.compile or not. Internally, all of the functions are implemented in terms of a compile step:

def match(pattern, string, flags=0):
    return _compile(pattern, flags).match(string)

def fullmatch(pattern, string, flags=0):
    return _compile(pattern, flags).fullmatch(string)

def search(pattern, string, flags=0):
    return _compile(pattern, flags).search(string)

def sub(pattern, repl, string, count=0, flags=0):
    return _compile(pattern, flags).sub(repl, string, count)

def subn(pattern, repl, string, count=0, flags=0):
    return _compile(pattern, flags).subn(repl, string, count)

def split(pattern, string, maxsplit=0, flags=0):
    return _compile(pattern, flags).split(string, maxsplit)

def findall(pattern, string, flags=0):
    return _compile(pattern, flags).findall(string)

def finditer(pattern, string, flags=0):
    return _compile(pattern, flags).finditer(string)

In addition, re.compile() bypasses the extra indirection and caching logic:

_cache = {}

_pattern_type = type(sre_compile.compile("", 0))

_MAXCACHE = 512
def _compile(pattern, flags):
    # internal: compile pattern
    try:
        p, loc = _cache[type(pattern), pattern, flags]
        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
            return p
    except KeyError:
        pass
    if isinstance(pattern, _pattern_type):
        if flags:
            raise ValueError(
                "cannot process flags argument with a compiled pattern")
        return pattern
    if not sre_compile.isstring(pattern):
        raise TypeError("first argument must be string or compiled pattern")
    p = sre_compile.compile(pattern, flags)
    if not (flags & DEBUG):
        if len(_cache) >= _MAXCACHE:
            _cache.clear()
        if p.flags & LOCALE:
            if not _locale:
                return p
            loc = _locale.setlocale(_locale.LC_CTYPE)
        else:
            loc = None
        _cache[type(pattern), pattern, flags] = p, loc
    return p

In addition to the small speed benefit from using re.compile, people also like the readability that comes from naming potentially complex pattern specifications and separating them from the business logic where there are applied:

#### Patterns ############################################################
number_pattern = re.compile(r'\d+(\.\d*)?')    # Integer or decimal number
assign_pattern = re.compile(r':=')             # Assignment operator
identifier_pattern = re.compile(r'[A-Za-z]+')  # Identifiers
whitespace_pattern = re.compile(r'[\t ]+')     # Spaces and tabs

#### Applications ########################################################

if whitespace_pattern.match(s): business_logic_rule_1()
if assign_pattern.match(s): business_logic_rule_2()

Note, one other respondent incorrectly believed that pyc files stored compiled patterns directly; however, in reality they are rebuilt each time when the PYC is loaded:

>>> from dis import dis
>>> with open('tmp.pyc', 'rb') as f:
        f.read(8)
        dis(marshal.load(f))

  1           0 LOAD_CONST               0 (-1)
              3 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (re)
              9 STORE_NAME               0 (re)

  3          12 LOAD_NAME                0 (re)
             15 LOAD_ATTR                1 (compile)
             18 LOAD_CONST               2 ('[aeiou]{2,5}')
             21 CALL_FUNCTION            1
             24 STORE_NAME               2 (lc_vowels)
             27 LOAD_CONST               1 (None)
             30 RETURN_VALUE

The above disassembly comes from the PYC file for a tmp.py containing:

import re
lc_vowels = re.compile(r'[aeiou]{2,5}')

0 讨论(0)

Is it worth using Python's re.compile?

Using in Findall

Using in search