re.match vs re.search performance difference

后端 未结 2 1802
北荒
北荒 2021-02-19 01:53

I tried to compare re.match and re.search using timeit module and I found that match was better than search when the string I want to foun

相关标签:
2条回答
  • 2021-02-19 02:35

    On my machine (Python 2.7.3 on Mac OS 10.7.3, 1.7 GHz Intel Core i5), when done putting the string construction, importing re, and the regex compiling in setup, and performing 10000000 iterations instead of 10, I find the opposite:

    import timeit
    
    print timeit.timeit(stmt="r.match(s)",
                 setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
                 number = 10000000)
    # 6.43165612221
    print timeit.timeit(stmt="r.search(s)",
                 setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
                number = 10000000)
    # 3.85176897049
    
    0 讨论(0)
  • 2021-02-19 02:47

    "So, the updated question is now why search is out-performing match?"

    In this particular instance where a literal string is used rather than a regex pattern, indeed re.search is slightly faster than re.match for the default CPython implementation (I have not tested this in other incarnations of Python).

    >>> print timeit.timeit(stmt="r.match(s)",
    ...              setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
    ...              number = 10000000)
    3.29107403755
    >>> print timeit.timeit(stmt="r.search(s)",
    ...              setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
    ...             number = 10000000)
    2.39184308052
    

    Looking into the C code behind those modules, it appears the search code has a built in optimisation to quickly match patterns prefixed with a string lateral. In the example above, the whole pattern is a literal string with no regex patterns and so this optimised routined is used to match the whole pattern.

    Notice how the performance degrades once we introduce regex symbols and, as the literal string prefix gets shorter:

    >>> print timeit.timeit(stmt="r.search(s)",
    ...              setup="import re; s = 'helloab'*100000; r = re.compile('hell.')",
    ...             number = 10000000)
    
    3.20765399933
    >>> 
    >>> print timeit.timeit(stmt="r.search(s)",
    ...              setup="import re; s = 'helloab'*100000; r = re.compile('hel.o')",
    ...             number = 10000000)
    3.31512498856
    >>> print timeit.timeit(stmt="r.search(s)",
    ...              setup="import re; s = 'helloab'*100000; r = re.compile('he.lo')",
    ...             number = 10000000)
    3.31983995438
    >>> print timeit.timeit(stmt="r.search(s)",
    ...              setup="import re; s = 'helloab'*100000; r = re.compile('h.llo')",
    ...             number = 10000000)
    3.39261603355
    

    For portion of the pattern that contain regex patterns, SRE_MATCH is used to determine matches. That is essentially the same code behind re.match.

    Note how the results are close (with re.match marginally faster) if the pattern starts with a regex pattern instead of a literal string.

    >>> print timeit.timeit(stmt="r.match(s)",
    ...              setup="import re; s = 'helloab'*100000; r = re.compile('.ello')",
    ...              number = 10000000)
    3.22782492638
    >>> print timeit.timeit(stmt="r.search(s)",
    ...              setup="import re; s = 'helloab'*100000; r = re.compile('.ello')",
    ...             number = 10000000)
    3.31773591042
    

    In other words, ignoring the fact that search and match have different purposes, re.search is faster than re.match only when the pattern is a literal string.

    Of course, if you're working with literal strings, you're likely to be better off using string operations instead.

    >>> # Detecting exact matches
    >>> print timeit.timeit(stmt="s == r", 
    ...              setup="s = 'helloab'*100000; r = 'hello'", 
    ...              number = 10000000)
    0.339027881622
    
    >>> # Determine if string contains another string
    >>> print timeit.timeit(stmt="s in r", 
    ...              setup="s = 'helloab'*100000; r = 'hello'", 
    ...              number = 10000000)
    0.479326963425
    
    
    >>> # detecting prefix
    >>> print timeit.timeit(stmt="s.startswith(r)",
    ...              setup="s = 'helloab'*100000; r = 'hello'",
    ...             number = 10000000)
    1.49393510818
    >>> print timeit.timeit(stmt="s[:len(r)] == r",
    ...              setup="s = 'helloab'*100000; r = 'hello'",
    ...             number = 10000000)
    1.21005606651
    
    0 讨论(0)
提交回复
热议问题