Is there any benefit in using compile for regular expressions in Python?
h = re.compile(\'hello\')
h.match(\'hello world\')
vs
For me, the biggest benefit to re.compile
is being able to separate definition of the regex from its use.
Even a simple expression such as 0|[1-9][0-9]*
(integer in base 10 without leading zeros) can be complex enough that you'd rather not have to retype it, check if you made any typos, and later have to recheck if there are typos when you start debugging. Plus, it's nicer to use a variable name such as num or num_b10 than 0|[1-9][0-9]*
.
It's certainly possible to store strings and pass them to re.match; however, that's less readable:
num = "..."
# then, much later:
m = re.match(num, input)
Versus compiling:
num = re.compile("...")
# then, much later:
m = num.match(input)
Though it is fairly close, the last line of the second feels more natural and simpler when used repeatedly.
Here is an example where using re.compile
is over 50 times faster, as requested.
The point is just the same as what I made in the comment above, namely, using re.compile
can be a significant advantage when your usage is such as to not benefit much from the compilation cache. This happens at least in one particular case (that I ran into in practice), namely when all of the following are true:
re._MAXCACHE
, whose default is currently 512), andre._MAXCACHE
other regexes in between, so that each one gets flushed from the cache between consecutive usages.import re
import time
def setup(N=1000):
# Patterns 'a.*a', 'a.*b', ..., 'z.*z'
patterns = [chr(i) + '.*' + chr(j)
for i in range(ord('a'), ord('z') + 1)
for j in range(ord('a'), ord('z') + 1)]
# If this assertion below fails, just add more (distinct) patterns.
# assert(re._MAXCACHE < len(patterns))
# N strings. Increase N for larger effect.
strings = ['abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'] * N
return (patterns, strings)
def without_compile():
print('Without re.compile:')
patterns, strings = setup()
print('searching')
count = 0
for s in strings:
for pat in patterns:
count += bool(re.search(pat, s))
return count
def without_compile_cache_friendly():
print('Without re.compile, cache-friendly order:')
patterns, strings = setup()
print('searching')
count = 0
for pat in patterns:
for s in strings:
count += bool(re.search(pat, s))
return count
def with_compile():
print('With re.compile:')
patterns, strings = setup()
print('compiling')
compiled = [re.compile(pattern) for pattern in patterns]
print('searching')
count = 0
for s in strings:
for regex in compiled:
count += bool(regex.search(s))
return count
start = time.time()
print(with_compile())
d1 = time.time() - start
print(f'-- That took {d1:.2f} seconds.\n')
start = time.time()
print(without_compile_cache_friendly())
d2 = time.time() - start
print(f'-- That took {d2:.2f} seconds.\n')
start = time.time()
print(without_compile())
d3 = time.time() - start
print(f'-- That took {d3:.2f} seconds.\n')
print(f'Ratio: {d3/d1:.2f}')
Example output I get on my laptop (Python 3.7.7):
With re.compile:
compiling
searching
676000
-- That took 0.33 seconds.
Without re.compile, cache-friendly order:
searching
676000
-- That took 0.67 seconds.
Without re.compile:
searching
676000
-- That took 23.54 seconds.
Ratio: 70.89
I didn't bother with timeit
as the difference is so stark, but I get qualitatively similar numbers each time. Note that even without re.compile
, using the same regex multiple times and moving on to the next one wasn't so bad (only about 2 times as slow as with re.compile
), but in the other order (looping through many regexes), it is significantly worse, as expected. Also, increasing the cache size works too: simply setting re._MAXCACHE = len(patterns)
in setup()
above (of course I don't recommend doing such things in production as names with underscores are conventionally “private”) drops the ~23 seconds back down to ~0.7 seconds, which also matches our understanding.
According to the Python documentation:
The sequence
prog = re.compile(pattern)
result = prog.match(string)
is equivalent to
result = re.match(pattern, string)
but using re.compile()
and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.
So my conclusion is, if you are going to match the same pattern for many different texts, you better precompile it.
As an alternative answer, as I see that it hasn't been mentioned before, I'll go ahead and quote the Python 3 docs:
Should you use these module-level functions, or should you get the pattern and call its methods yourself? If you’re accessing a regex within a loop, pre-compiling it will save a few function calls. Outside of loops, there’s not much difference thanks to the internal cache.
I've had a lot of experience running a compiled regex 1000s of times versus compiling on-the-fly, and have not noticed any perceivable difference
The votes on the accepted answer leads to the assumption that what @Triptych says is true for all cases. This is not necessarily true. One big difference is when you have to decide whether to accept a regex string or a compiled regex object as a parameter to a function:
>>> timeit.timeit(setup="""
... import re
... f=lambda x, y: x.match(y) # accepts compiled regex as parameter
... h=re.compile('hello')
... """, stmt="f(h, 'hello world')")
0.32881879806518555
>>> timeit.timeit(setup="""
... import re
... f=lambda x, y: re.compile(x).match(y) # compiles when called
... """, stmt="f('hello', 'hello world')")
0.809190034866333
It is always better to compile your regexs in case you need to reuse them.
Note the example in the timeit above simulates creation of a compiled regex object once at import time versus "on-the-fly" when required for a match.
Here's a simple test case:
~$ for x in 1 10 100 1000 10000 100000 1000000; do python -m timeit -n $x -s 'import re' 're.match("[0-9]{3}-[0-9]{3}-[0-9]{4}", "123-123-1234")'; done
1 loops, best of 3: 3.1 usec per loop
10 loops, best of 3: 2.41 usec per loop
100 loops, best of 3: 2.24 usec per loop
1000 loops, best of 3: 2.21 usec per loop
10000 loops, best of 3: 2.23 usec per loop
100000 loops, best of 3: 2.24 usec per loop
1000000 loops, best of 3: 2.31 usec per loop
with re.compile:
~$ for x in 1 10 100 1000 10000 100000 1000000; do python -m timeit -n $x -s 'import re' 'r = re.compile("[0-9]{3}-[0-9]{3}-[0-9]{4}")' 'r.match("123-123-1234")'; done
1 loops, best of 3: 1.91 usec per loop
10 loops, best of 3: 0.691 usec per loop
100 loops, best of 3: 0.701 usec per loop
1000 loops, best of 3: 0.684 usec per loop
10000 loops, best of 3: 0.682 usec per loop
100000 loops, best of 3: 0.694 usec per loop
1000000 loops, best of 3: 0.702 usec per loop
So, it would seem to compiling is faster with this simple case, even if you only match once.