re.match vs re.search performance difference

后端 未结 2 1794
北荒
北荒 2021-02-19 01:53

I tried to compare re.match and re.search using timeit module and I found that match was better than search when the string I want to foun

2条回答
  •  离开以前
    2021-02-19 02:47

    "So, the updated question is now why search is out-performing match?"

    In this particular instance where a literal string is used rather than a regex pattern, indeed re.search is slightly faster than re.match for the default CPython implementation (I have not tested this in other incarnations of Python).

    >>> print timeit.timeit(stmt="r.match(s)",
    ...              setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
    ...              number = 10000000)
    3.29107403755
    >>> print timeit.timeit(stmt="r.search(s)",
    ...              setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
    ...             number = 10000000)
    2.39184308052
    

    Looking into the C code behind those modules, it appears the search code has a built in optimisation to quickly match patterns prefixed with a string lateral. In the example above, the whole pattern is a literal string with no regex patterns and so this optimised routined is used to match the whole pattern.

    Notice how the performance degrades once we introduce regex symbols and, as the literal string prefix gets shorter:

    >>> print timeit.timeit(stmt="r.search(s)",
    ...              setup="import re; s = 'helloab'*100000; r = re.compile('hell.')",
    ...             number = 10000000)
    
    3.20765399933
    >>> 
    >>> print timeit.timeit(stmt="r.search(s)",
    ...              setup="import re; s = 'helloab'*100000; r = re.compile('hel.o')",
    ...             number = 10000000)
    3.31512498856
    >>> print timeit.timeit(stmt="r.search(s)",
    ...              setup="import re; s = 'helloab'*100000; r = re.compile('he.lo')",
    ...             number = 10000000)
    3.31983995438
    >>> print timeit.timeit(stmt="r.search(s)",
    ...              setup="import re; s = 'helloab'*100000; r = re.compile('h.llo')",
    ...             number = 10000000)
    3.39261603355
    

    For portion of the pattern that contain regex patterns, SRE_MATCH is used to determine matches. That is essentially the same code behind re.match.

    Note how the results are close (with re.match marginally faster) if the pattern starts with a regex pattern instead of a literal string.

    >>> print timeit.timeit(stmt="r.match(s)",
    ...              setup="import re; s = 'helloab'*100000; r = re.compile('.ello')",
    ...              number = 10000000)
    3.22782492638
    >>> print timeit.timeit(stmt="r.search(s)",
    ...              setup="import re; s = 'helloab'*100000; r = re.compile('.ello')",
    ...             number = 10000000)
    3.31773591042
    

    In other words, ignoring the fact that search and match have different purposes, re.search is faster than re.match only when the pattern is a literal string.

    Of course, if you're working with literal strings, you're likely to be better off using string operations instead.

    >>> # Detecting exact matches
    >>> print timeit.timeit(stmt="s == r", 
    ...              setup="s = 'helloab'*100000; r = 'hello'", 
    ...              number = 10000000)
    0.339027881622
    
    >>> # Determine if string contains another string
    >>> print timeit.timeit(stmt="s in r", 
    ...              setup="s = 'helloab'*100000; r = 'hello'", 
    ...              number = 10000000)
    0.479326963425
    
    
    >>> # detecting prefix
    >>> print timeit.timeit(stmt="s.startswith(r)",
    ...              setup="s = 'helloab'*100000; r = 'hello'",
    ...             number = 10000000)
    1.49393510818
    >>> print timeit.timeit(stmt="s[:len(r)] == r",
    ...              setup="s = 'helloab'*100000; r = 'hello'",
    ...             number = 10000000)
    1.21005606651
    

提交回复
热议问题