I tried to compare re.match
and re.search
using timeit
module and I found that match was better than search when the string I want to foun
On my machine (Python 2.7.3 on Mac OS 10.7.3, 1.7 GHz Intel Core i5), when done putting the string construction, importing re, and the regex compiling in setup, and performing 10000000 iterations instead of 10, I find the opposite:
import timeit
print timeit.timeit(stmt="r.match(s)",
setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
number = 10000000)
# 6.43165612221
print timeit.timeit(stmt="r.search(s)",
setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
number = 10000000)
# 3.85176897049
"So, the updated question is now why search is out-performing match?"
In this particular instance where a literal string is used rather than a regex pattern, indeed re.search
is slightly faster than re.match
for the default CPython implementation (I have not tested this in other incarnations of Python).
>>> print timeit.timeit(stmt="r.match(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
... number = 10000000)
3.29107403755
>>> print timeit.timeit(stmt="r.search(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
... number = 10000000)
2.39184308052
Looking into the C code behind those modules, it appears the search code has a built in optimisation to quickly match patterns prefixed with a string lateral. In the example above, the whole pattern is a literal string with no regex patterns and so this optimised routined is used to match the whole pattern.
Notice how the performance degrades once we introduce regex symbols and, as the literal string prefix gets shorter:
>>> print timeit.timeit(stmt="r.search(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('hell.')",
... number = 10000000)
3.20765399933
>>>
>>> print timeit.timeit(stmt="r.search(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('hel.o')",
... number = 10000000)
3.31512498856
>>> print timeit.timeit(stmt="r.search(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('he.lo')",
... number = 10000000)
3.31983995438
>>> print timeit.timeit(stmt="r.search(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('h.llo')",
... number = 10000000)
3.39261603355
For portion of the pattern that contain regex patterns, SRE_MATCH is used to determine matches. That is essentially the same code behind re.match
.
Note how the results are close (with re.match
marginally faster) if the pattern starts with a regex pattern instead of a literal string.
>>> print timeit.timeit(stmt="r.match(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('.ello')",
... number = 10000000)
3.22782492638
>>> print timeit.timeit(stmt="r.search(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('.ello')",
... number = 10000000)
3.31773591042
In other words, ignoring the fact that search
and match
have different purposes, re.search
is faster than re.match
only when the pattern is a literal string.
Of course, if you're working with literal strings, you're likely to be better off using string operations instead.
>>> # Detecting exact matches
>>> print timeit.timeit(stmt="s == r",
... setup="s = 'helloab'*100000; r = 'hello'",
... number = 10000000)
0.339027881622
>>> # Determine if string contains another string
>>> print timeit.timeit(stmt="s in r",
... setup="s = 'helloab'*100000; r = 'hello'",
... number = 10000000)
0.479326963425
>>> # detecting prefix
>>> print timeit.timeit(stmt="s.startswith(r)",
... setup="s = 'helloab'*100000; r = 'hello'",
... number = 10000000)
1.49393510818
>>> print timeit.timeit(stmt="s[:len(r)] == r",
... setup="s = 'helloab'*100000; r = 'hello'",
... number = 10000000)
1.21005606651