String matching performance: gcc versus CPython

后端未结

关注

 1  546

臣服心动 2021-01-04 11:31

Whilst researching performance trade-offs between Python and C++, I\'ve devised a small example, which mostly focusses on a dumb substring matching.

Here is the rele

1条回答

一整个雨季 (楼主)

2021-01-04 12:11

The python 3.4 code b'abc' in b'abcabc' (or b'abcabc'.__contains__(b'abc') as in your example) executes the bytes_contains method, which in turn calls the inlined function stringlib_find; which delegates the search to FASTSEARCH.

The FASTSEARCH function then uses a simplified Boyer-Moore search algorithm (Boyer-Moore-Horspool):

fast search/count implementation, based on a mix between boyer- moore and horspool, with a few more bells and whistles on the top. for some more background, see: http://effbot.org/zone/stringlib.htm

There are some modifications too, as noted by the comments:

note: fastsearch may access s[n], which isn't a problem when using Python's ordinary string types, but may cause problems if you're using this code in other contexts. also, the count mode returns -1 if there cannot possible be a match in the target string, and 0 if it has actually checked for matches, but didn't find any. callers beware!

The GNU C++ Standard Library basic_string::find() implementation is as generic (and dumb) as possible; it just tries dumbly matching the pattern at each and every consecutive character position until it finds the match.

TL;DR: The reason why the C++ standard library is so slow compared to Python is because it tries to do a generic algorithm on top of std::basic_string, but fails to do it efficiently for the more interesting cases; whereas in Python the programmer gets the most efficient algorithms on case-by-case basis for free.

0 讨论(0)
发布评论:

提交评论
- 加载中...