String matching performance: gcc versus CPython

后端 未结 1 547
臣服心动
臣服心动 2021-01-04 11:31

Whilst researching performance trade-offs between Python and C++, I\'ve devised a small example, which mostly focusses on a dumb substring matching.

Here is the rele

相关标签:
1条回答
  • 2021-01-04 12:11

    The python 3.4 code b'abc' in b'abcabc' (or b'abcabc'.__contains__(b'abc') as in your example) executes the bytes_contains method, which in turn calls the inlined function stringlib_find; which delegates the search to FASTSEARCH.

    The FASTSEARCH function then uses a simplified Boyer-Moore search algorithm (Boyer-Moore-Horspool):

    fast search/count implementation, based on a mix between boyer- moore and horspool, with a few more bells and whistles on the top. for some more background, see: http://effbot.org/zone/stringlib.htm

    There are some modifications too, as noted by the comments:

    note: fastsearch may access s[n], which isn't a problem when using Python's ordinary string types, but may cause problems if you're using this code in other contexts. also, the count mode returns -1 if there cannot possible be a match in the target string, and 0 if it has actually checked for matches, but didn't find any. callers beware!


    The GNU C++ Standard Library basic_string<T>::find() implementation is as generic (and dumb) as possible; it just tries dumbly matching the pattern at each and every consecutive character position until it finds the match.


    TL;DR: The reason why the C++ standard library is so slow compared to Python is because it tries to do a generic algorithm on top of std::basic_string<char>, but fails to do it efficiently for the more interesting cases; whereas in Python the programmer gets the most efficient algorithms on case-by-case basis for free.

    0 讨论(0)
提交回复
热议问题