Regular Expressions in Python unexpectedly slow

后端 未结 4 1734
迷失自我
迷失自我 2021-02-01 16:16

Consider this Python code:

import timeit
import re

def one():
        any(s in mystring for s in (\'foo\', \'bar\', \'hello\'))

r = re.compile(\'(foo|bar|hello         


        
4条回答
  •  醉酒成梦
    2021-02-01 16:32

    The reason the regex is so slow is because it not only has to go through the whole string, but it has to several calculations at every character.

    The first one simply does this:

    Does f match h? No.
    Does b match h? No.
    Does h match h? Yes.
    Does e match e? Yes.
    Does l match l? Yes.
    Does l match l? Yes.
    Does o match o? Yes.
    Done. Match found.
    

    The second one does this:

    Does f match g? No.
    Does b match g? No.
    Does h match g? No.
    Does f match o? No.
    Does b match o? No.
    Does h match o? No.
    Does f match o? No.
    Does b match o? No.
    Does h match o? No.
    Does f match d? No.
    Does b match d? No.
    Does h match d? No.
    Does f match b? No.
    Does b match b? Yes.
    Does a match y? No.
    Does h match b? No.
    Does f match y? No.
    Does b match y? No.
    Does h match y? No.
    Does f match e? No.
    Does b match e? No.
    Does h match e? No.
    ... 999 more times ...
    Done. No match found.
    

    I can only speculate about the difference between the any and regex, but I'm guessing the regex is slower mostly because it runs in a highly complex engine, and with state machine stuff and everything, it just isn't as efficient as a specific implementation (in).

    In the first string, the regex will find a match almost instantaneously, while any has to loop through the string twice before finding anything.

    In the second string, however, the any performs essentially the same steps as the regex, but in a different order. This seems to point out that the any solution is faster, probably because it is simpler.

    Specific code is more efficient than generic code. Any knowledge about the problem can be put to use in optimizing the solution. Simple code is preferred over complex code. Essentially, the regex is faster when the pattern will be near the start of the string, but in is faster when the pattern is near the end of the string, or not found at all.

    Disclaimer: I don't know Python. I know algorithms.

提交回复
热议问题