Is there any benefit in using compile for regular expressions in Python?
h = re.compile(\'hello\')
h.match(\'hello world\')
vs
Interestingly, compiling does prove more efficient for me (Python 2.5.2 on Win XP):
import re
import time
rgx = re.compile('(\w+)\s+[0-9_]?\s+\w*')
str = "average 2 never"
a = 0
t = time.time()
for i in xrange(1000000):
if re.match('(\w+)\s+[0-9_]?\s+\w*', str):
#~ if rgx.match(str):
a += 1
print time.time() - t
Running the above code once as is, and once with the two if
lines commented the other way around, the compiled regex is twice as fast
(months later) it's easy to add your own cache around re.match, or anything else for that matter --
""" Re.py: Re.match = re.match + cache
efficiency: re.py does this already (but what's _MAXCACHE ?)
readability, inline / separate: matter of taste
"""
import re
cache = {}
_re_type = type( re.compile( "" ))
def match( pattern, str, *opt ):
""" Re.match = re.match + cache re.compile( pattern )
"""
if type(pattern) == _re_type:
cpat = pattern
elif pattern in cache:
cpat = cache[pattern]
else:
cpat = cache[pattern] = re.compile( pattern, *opt )
return cpat.match( str )
# def search ...
A wibni, wouldn't it be nice if: cachehint( size= ), cacheinfo() -> size, hits, nclear ...
I really respect all the above answers. From my opinion Yes! For sure it is worth to use re.compile instead of compiling the regex, again and again, every time.
Using re.compile makes your code more dynamic, as you can call the already compiled regex, instead of compiling again and aagain. This thing benefits you in cases:
Example :
example_string = "The room number of her room is 26A7B."
find_alpha_numeric_string = re.compile(r"\b\w+\b")
find_alpha_numeric_string.findall(example_string)
find_alpha_numeric_string.search(example_string)
Similarly you can use it for: Match and Substitute
i'd like to motivate that pre-compiling is both conceptually and 'literately' (as in 'literate programming') advantageous. have a look at this code snippet:
from re import compile as _Re
class TYPO:
def text_has_foobar( self, text ):
return self._text_has_foobar_re_search( text ) is not None
_text_has_foobar_re_search = _Re( r"""(?i)foobar""" ).search
TYPO = TYPO()
in your application, you'd write:
from TYPO import TYPO
print( TYPO.text_has_foobar( 'FOObar ) )
this is about as simple in terms of functionality as it can get. because this is example is so short, i conflated the way to get _text_has_foobar_re_search
all in one line. the disadvantage of this code is that it occupies a little memory for whatever the lifetime of the TYPO
library object is; the advantage is that when doing a foobar search, you'll get away with two function calls and two class dictionary lookups. how many regexes are cached by re
and the overhead of that cache are irrelevant here.
compare this with the more usual style, below:
import re
class Typo:
def text_has_foobar( self, text ):
return re.compile( r"""(?i)foobar""" ).search( text ) is not None
In the application:
typo = Typo()
print( typo.text_has_foobar( 'FOObar ) )
I readily admit that my style is highly unusual for python, maybe even debatable. however, in the example that more closely matches how python is mostly used, in order to do a single match, we must instantiate an object, do three instance dictionary lookups, and perform three function calls; additionally, we might get into re
caching troubles when using more than 100 regexes. also, the regular expression gets hidden inside the method body, which most of the time is not such a good idea.
be it said that every subset of measures---targeted, aliased import statements; aliased methods where applicable; reduction of function calls and object dictionary lookups---can help reduce computational and conceptual complexity.
I ran this test before stumbling upon the discussion here. However, having run it I thought I'd at least post my results.
I stole and bastardized the example in Jeff Friedl's "Mastering Regular Expressions". This is on a macbook running OSX 10.6 (2Ghz intel core 2 duo, 4GB ram). Python version is 2.6.1.
Run 1 - using re.compile
import re
import time
import fpformat
Regex1 = re.compile('^(a|b|c|d|e|f|g)+$')
Regex2 = re.compile('^[a-g]+$')
TimesToDo = 1000
TestString = ""
for i in range(1000):
TestString += "abababdedfg"
StartTime = time.time()
for i in range(TimesToDo):
Regex1.search(TestString)
Seconds = time.time() - StartTime
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"
StartTime = time.time()
for i in range(TimesToDo):
Regex2.search(TestString)
Seconds = time.time() - StartTime
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"
Alternation takes 2.299 seconds
Character Class takes 0.107 seconds
Run 2 - Not using re.compile
import re
import time
import fpformat
TimesToDo = 1000
TestString = ""
for i in range(1000):
TestString += "abababdedfg"
StartTime = time.time()
for i in range(TimesToDo):
re.search('^(a|b|c|d|e|f|g)+$',TestString)
Seconds = time.time() - StartTime
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"
StartTime = time.time()
for i in range(TimesToDo):
re.search('^[a-g]+$',TestString)
Seconds = time.time() - StartTime
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"
Alternation takes 2.508 seconds
Character Class takes 0.109 seconds
Mostly, there is little difference whether you use re.compile or not. Internally, all of the functions are implemented in terms of a compile step:
def match(pattern, string, flags=0):
return _compile(pattern, flags).match(string)
def fullmatch(pattern, string, flags=0):
return _compile(pattern, flags).fullmatch(string)
def search(pattern, string, flags=0):
return _compile(pattern, flags).search(string)
def sub(pattern, repl, string, count=0, flags=0):
return _compile(pattern, flags).sub(repl, string, count)
def subn(pattern, repl, string, count=0, flags=0):
return _compile(pattern, flags).subn(repl, string, count)
def split(pattern, string, maxsplit=0, flags=0):
return _compile(pattern, flags).split(string, maxsplit)
def findall(pattern, string, flags=0):
return _compile(pattern, flags).findall(string)
def finditer(pattern, string, flags=0):
return _compile(pattern, flags).finditer(string)
In addition, re.compile() bypasses the extra indirection and caching logic:
_cache = {}
_pattern_type = type(sre_compile.compile("", 0))
_MAXCACHE = 512
def _compile(pattern, flags):
# internal: compile pattern
try:
p, loc = _cache[type(pattern), pattern, flags]
if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
return p
except KeyError:
pass
if isinstance(pattern, _pattern_type):
if flags:
raise ValueError(
"cannot process flags argument with a compiled pattern")
return pattern
if not sre_compile.isstring(pattern):
raise TypeError("first argument must be string or compiled pattern")
p = sre_compile.compile(pattern, flags)
if not (flags & DEBUG):
if len(_cache) >= _MAXCACHE:
_cache.clear()
if p.flags & LOCALE:
if not _locale:
return p
loc = _locale.setlocale(_locale.LC_CTYPE)
else:
loc = None
_cache[type(pattern), pattern, flags] = p, loc
return p
In addition to the small speed benefit from using re.compile, people also like the readability that comes from naming potentially complex pattern specifications and separating them from the business logic where there are applied:
#### Patterns ############################################################
number_pattern = re.compile(r'\d+(\.\d*)?') # Integer or decimal number
assign_pattern = re.compile(r':=') # Assignment operator
identifier_pattern = re.compile(r'[A-Za-z]+') # Identifiers
whitespace_pattern = re.compile(r'[\t ]+') # Spaces and tabs
#### Applications ########################################################
if whitespace_pattern.match(s): business_logic_rule_1()
if assign_pattern.match(s): business_logic_rule_2()
Note, one other respondent incorrectly believed that pyc files stored compiled patterns directly; however, in reality they are rebuilt each time when the PYC is loaded:
>>> from dis import dis
>>> with open('tmp.pyc', 'rb') as f:
f.read(8)
dis(marshal.load(f))
1 0 LOAD_CONST 0 (-1)
3 LOAD_CONST 1 (None)
6 IMPORT_NAME 0 (re)
9 STORE_NAME 0 (re)
3 12 LOAD_NAME 0 (re)
15 LOAD_ATTR 1 (compile)
18 LOAD_CONST 2 ('[aeiou]{2,5}')
21 CALL_FUNCTION 1
24 STORE_NAME 2 (lc_vowels)
27 LOAD_CONST 1 (None)
30 RETURN_VALUE
The above disassembly comes from the PYC file for a tmp.py
containing:
import re
lc_vowels = re.compile(r'[aeiou]{2,5}')