Regular Expressions in Python unexpectedly slow

后端 未结 4 1736
迷失自我
迷失自我 2021-02-01 16:16

Consider this Python code:

import timeit
import re

def one():
        any(s in mystring for s in (\'foo\', \'bar\', \'hello\'))

r = re.compile(\'(foo|bar|hello         


        
相关标签:
4条回答
  • 2021-02-01 16:23

    You have a regexp that is made up of three regexps. Exactly how do you think that works, if the regexp doesn't check this three times? :-) There's no magic in computing, you still have to do three checks.

    But the regexp will do each three tests character by character, while the "one()" method will check the whole string for one match before going onto the next one.

    That the regexp is much faster in the first case is because you check for the string that will match last. That means one() needs to first look through the whole string for "foo", then for "bar" and then for "hello", where it matches. Move "hello" first, and one() and two() are almost the same speed, as the first match done in both cases succeed.

    Regexps are much more complex tests than "in" so I'd expect it to be slower. I suspect that this complexity increases a lot when you use "|", but I haven't read the source for the regexp library, so what do I know. :-)

    0 讨论(0)
  • 2021-02-01 16:25

    Note to future readers

    I think the correct answer is actually that Python's string handling algorithms are really optimized for this case, and the re module is actually a bit slower. What I've written below is true, but is probably not relevant to the simple regexps I have in the question.

    Original Answer

    Apparently this is not a random fluke - Python's re module really is slower. It looks like it uses a recursive backtracking approach when it fails to find a match, as opposed to building a DFA and simulating it.

    It uses the backtracking approach even when there are no back references in the regular expression!

    What this means is that in the worst case, Python regexs take exponential, and not linear, time!

    This is a very detailed paper describing the issue: http://swtch.com/~rsc/regexp/regexp1.html

    I think this graph near the end summarizes it succinctly: graph of performance of various regular expression implementations, time vs. string length

    0 讨论(0)
  • 2021-02-01 16:27

    My coworker found the re2 library (https://code.google.com/p/re2/)? There is a python wrapper. It's a bit to get installed on some systems.

    I was having the same issue with some complex regexes and long strings -- re2 sped the processing time up significantly -- from seconds to milliseconds.

    0 讨论(0)
  • 2021-02-01 16:32

    The reason the regex is so slow is because it not only has to go through the whole string, but it has to several calculations at every character.

    The first one simply does this:

    Does f match h? No.
    Does b match h? No.
    Does h match h? Yes.
    Does e match e? Yes.
    Does l match l? Yes.
    Does l match l? Yes.
    Does o match o? Yes.
    Done. Match found.
    

    The second one does this:

    Does f match g? No.
    Does b match g? No.
    Does h match g? No.
    Does f match o? No.
    Does b match o? No.
    Does h match o? No.
    Does f match o? No.
    Does b match o? No.
    Does h match o? No.
    Does f match d? No.
    Does b match d? No.
    Does h match d? No.
    Does f match b? No.
    Does b match b? Yes.
    Does a match y? No.
    Does h match b? No.
    Does f match y? No.
    Does b match y? No.
    Does h match y? No.
    Does f match e? No.
    Does b match e? No.
    Does h match e? No.
    ... 999 more times ...
    Done. No match found.
    

    I can only speculate about the difference between the any and regex, but I'm guessing the regex is slower mostly because it runs in a highly complex engine, and with state machine stuff and everything, it just isn't as efficient as a specific implementation (in).

    In the first string, the regex will find a match almost instantaneously, while any has to loop through the string twice before finding anything.

    In the second string, however, the any performs essentially the same steps as the regex, but in a different order. This seems to point out that the any solution is faster, probably because it is simpler.

    Specific code is more efficient than generic code. Any knowledge about the problem can be put to use in optimizing the solution. Simple code is preferred over complex code. Essentially, the regex is faster when the pattern will be near the start of the string, but in is faster when the pattern is near the end of the string, or not found at all.

    Disclaimer: I don't know Python. I know algorithms.

    0 讨论(0)
提交回复
热议问题