Bear with me, I can\'t include my 1,000+ line program, and there are a couple of questions in the description.
So I have a couple types of patterns I am searching for:>
Let's say that word1
, word2
... are regexes:
let's rewrite those parts:
allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]
I would create one single regex for all patterns:
allWords = re.compile("|".join(["word1", "word2", "word3"])
To support regexes with |
in them, you would have to parenthesize the expressions:
allWords = re.compile("|".join("({})".format(x) for x in ["word1", "word2", "word3"])
(that also works with standard words of course, and it's still worth using regexes because of the |
part)
now this is a disguised loop with each term hardcoded:
def bar(data, allWords):
if allWords[0].search(data):
temp = data.split("word1", 1)[1] # that works only on non-regexes BTW
return(temp)
elif allWords[1].search(data):
temp = data.split("word2", 1)[1]
return(temp)
can be rewritten simply as
def bar(data, allWords):
return allWords.split(data,maxsplit=1)[1]
in terms of performance:
The last hiccup is that internally the regex engine searches for all expressions in a loop, which makes that a O(n)
algorithm. To make it faster, you would have to predict which pattern is the most frequent, and put it first (my hypothesis is that regexes are "disjoint", which means that a text cannot be matched by several ones, else the longest would have to come before the shorter one)