When to use re.compile

后端未结

关注

 2  1830

遇见更好的自我 2021-02-15 17:09

Bear with me, I can\'t include my 1,000+ line program, and there are a couple of questions in the description.

So I have a couple types of patterns I am searching for:

2条回答

臣服心动 (楼主)

2021-02-15 18:01
Let's say that word1, word2 ... are regexes:

let's rewrite those parts:
```
allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]
```
I would create one single regex for all patterns:
```
allWords = re.compile("|".join(["word1", "word2", "word3"])
```
To support regexes with | in them, you would have to parenthesize the expressions:
```
allWords = re.compile("|".join("({})".format(x) for x in ["word1", "word2", "word3"])
```
(that also works with standard words of course, and it's still worth using regexes because of the | part)

now this is a disguised loop with each term hardcoded:
```
def bar(data, allWords):
   if allWords[0].search(data):
      temp = data.split("word1", 1)[1]  # that works only on non-regexes BTW
      return(temp)

   elif allWords[1].search(data):
      temp = data.split("word2", 1)[1]
      return(temp)
```
can be rewritten simply as
```
def bar(data, allWords):
   return allWords.split(data,maxsplit=1)[1]
```
in terms of performance:
- regular expression is compiled at start, so it's as fast as it can be
- there's no loop or pasted expressions, the "or" part is done by the regex engine, which is most of the time some compiled code: can't beat that in pure python.
- the match & the split are done in one operation
The last hiccup is that internally the regex engine searches for all expressions in a loop, which makes that a O(n) algorithm. To make it faster, you would have to predict which pattern is the most frequent, and put it first (my hypothesis is that regexes are "disjoint", which means that a text cannot be matched by several ones, else the longest would have to come before the shorter one)
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...