When to use re.compile

后端 未结 2 1830
遇见更好的自我
遇见更好的自我 2021-02-15 17:09

Bear with me, I can\'t include my 1,000+ line program, and there are a couple of questions in the description.

So I have a couple types of patterns I am searching for:

2条回答
  •  臣服心动
    2021-02-15 18:01

    Let's say that word1, word2 ... are regexes:

    let's rewrite those parts:

    allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]
    

    I would create one single regex for all patterns:

    allWords = re.compile("|".join(["word1", "word2", "word3"])
    

    To support regexes with | in them, you would have to parenthesize the expressions:

    allWords = re.compile("|".join("({})".format(x) for x in ["word1", "word2", "word3"])
    

    (that also works with standard words of course, and it's still worth using regexes because of the | part)

    now this is a disguised loop with each term hardcoded:

    def bar(data, allWords):
       if allWords[0].search(data):
          temp = data.split("word1", 1)[1]  # that works only on non-regexes BTW
          return(temp)
    
       elif allWords[1].search(data):
          temp = data.split("word2", 1)[1]
          return(temp)
    

    can be rewritten simply as

    def bar(data, allWords):
       return allWords.split(data,maxsplit=1)[1]
    

    in terms of performance:

    • regular expression is compiled at start, so it's as fast as it can be
    • there's no loop or pasted expressions, the "or" part is done by the regex engine, which is most of the time some compiled code: can't beat that in pure python.
    • the match & the split are done in one operation

    The last hiccup is that internally the regex engine searches for all expressions in a loop, which makes that a O(n) algorithm. To make it faster, you would have to predict which pattern is the most frequent, and put it first (my hypothesis is that regexes are "disjoint", which means that a text cannot be matched by several ones, else the longest would have to come before the shorter one)

提交回复
热议问题