Bear with me, I can\'t include my 1,000+ line program, and there are a couple of questions in the description.
So I have a couple types of patterns I am searching for:>
This is a tricky subject: many answers, even some legitimate sources such as David Beazley's Python Cookbook, will tell you something like:
[Use
compile()
] when you’re going to perform a lot of matches using the same pattern. This lets you compile the regex only once versus at each match. [see p. 45 of that book]
However, that really hasn't been true since sometime around Python 2.5. Here's a note straight out of the re
docs:
Note The compiled versions of the most recent patterns passed to
re.compile()
and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.
There are two small arguments against this, but (anecdotally speaking) these won't result in noticeable timing differences the majority of the time:
Here's a rudimentary test of the above using the 20 newsgroups text dataset. On a relative basis, the improvement in speed is about 1.6% with compiling, presumably due mostly to cache lookup.
import re
from sklearn.datasets import fetch_20newsgroups
# A list of length ~20,000, paragraphs of text
news = fetch_20newsgroups(subset='all', random_state=444).data
# The tokenizer used by most text-processing vectorizers such as TF-IDF
regex = r'(?u)\b\w\w+\b'
regex_comp = re.compile(regex)
def no_compile():
for text in news:
re.findall(regex, text)
def with_compile():
for text in news:
regex_comp.findall(text)
%timeit -r 3 -n 5 no_compile()
1.78 s ± 16.2 ms per loop (mean ± std. dev. of 3 runs, 5 loops each)
%timeit -r 3 -n 5 with_compile()
1.75 s ± 12.2 ms per loop (mean ± std. dev. of 3 runs, 5 loops each)
That really only leaves one very defensible reason to use re.compile()
:
By precompiling all expressions when the module is loaded, the compilation work is shifted to application start time, instead of to a point when the program may be responding to a user action. [source; p. 15]. It's not uncommon to see constants declared at the top of a module with
compile
. For example, in smtplib you'll findOLDSTYLE_AUTH = re.compile(r"auth=(.*)", re.I)
.
Note that compiling happens (eventually) whether or not you use re.compile()
. When you do use compile()
, you're compiling the passed regex at that moment. If you use the module-level functions like re.search()
, you're compiling and searching in this one call. The two processes below are equivalent in this regard:
# with re.compile - gets you a regular expression object (class)
# and then call its method, `.search()`.
a = re.compile('regex[es|p]') # compiling happens now
a.search('regexp') # searching happens now
# with module-level function
re.search('regex[es|p]', 'regexp') # compiling and searching both happen here
Lastly you asked,
Is there a better way to match regular words without regex?
Yes; this is mentioned as a "common problem" in the HOWTO:
Sometimes using the re module is a mistake. If you’re matching a fixed string, or a single character class, and you’re not using any re features such as the IGNORECASE flag, then the full power of regular expressions may not be required. Strings have several methods for performing operations with fixed strings and they’re usually much faster, because the implementation is a single small C loop that’s been optimized for the purpose, instead of the large, more generalized regular expression engine. [emphasis added]
...
In short, before turning to the re module, consider whether your problem can be solved with a faster and simpler string method.