What's the computational cost of count operation on strings Python?

╄→гoц情女王★ 提交于 2021-01-22 05:41:21

问题


For example:

'hello'.count('e')

Is this O(n)? I'm guessing the way it works is it scans 'hello' and increments a counter each time the letter 'e' is seen. How can I know this without guessing? I tried reading the source code here, but got stuck upon finding this:

def count(s, *args):
    """count(s, sub[, start[,end]]) -> int

    Return the number of occurrences of substring sub in string
    s[start:end].  Optional arguments start and end are
    interpreted as in slice notation.

    """
    return s.count(*args)

Where can I read about what's executed in s.count(*args)?

edit: I understand what *args does in the context of Python functions.


回答1:


str.count is implemented in native code, in the stringobject.c file, which delegates to either stringlib_count, or PyUnicode_Count which itself delegates to stringlib_count again. stringlib_count ultimately uses fastsearch to search for occurrences of the substring in the string and counting those.

For one-character strings (e.g. your 'e'), it is short-circuited to the following code path:

for (i = 0; i < n; i++)
    if (s[i] == p[0]) {
        count++;
        if (count == maxcount)
            return maxcount;
    }
return count;

So yes, this is exactly as you assumed a simple iteration over the string sequence and counting the occurences of the substring.

For search strings longer than a single character it gets a bit more complicated, due to handling overlaps etc., and the logic is buried deeper in the fastsearch implementation. But it’s essentially the same: a linear search through the string.

So yes, str.count is in linear time, O(n). And if you think about it, it makes a lot of sense: In order to know how often a substring appears in a string, you need to look at every possible substring of the same length. So for a substring length of 1, you have to look at every character in the string, giving you a linear complexity.

Btw. for more information about the underlying fastsearch algorithm, see this article on effbot.org.


For Python 3, which only has a single Unicode string type, the links to the implementations are: unicode_count which uses stringlib_count which uses fastsearch.




回答2:


Much of python's library code is written in C. The code you are looking for is here:

http://svn.python.org/view/python/trunk/Objects/stringobject.c?view=markup

static PyMethodDef
string_methods[] = {
    // ...
    {"count", (PyCFunction)string_count, METH_VARARGS, count__doc__},
    // ...
    {NULL,     NULL}                         /* sentinel */
};

static PyObject *
string_count(PyStringObject *self, PyObject *args) {
    ...
}



回答3:


If you pursue @AJNeufeld's answer a little ways, you will eventually come upon this link, which explains how the (then-)new find logic works. It's a combination of several string searching approaches, with the intent of benefiting from some of the logic, but avoiding the up-front table setup costs for searches: http://effbot.org/zone/stringlib.htm

Boyer-Moore is a famous string searching algorithm. BM-Horspool and BM-Sunday are variants that improve on the original in certain ways. Google will find you more than you ever wanted to know about these.



来源:https://stackoverflow.com/questions/35855748/whats-the-computational-cost-of-count-operation-on-strings-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!