Why is string's startswith slower than in?

前端 未结 2 780
日久生厌
日久生厌 2020-12-24 10:35

Surprisingly, I find startswith is slower than in:

In [10]: s=\"ABCD\"*10

In [11]: %timeit s.startswith(\"XYZ\")
1000000 loops, be         


        
相关标签:
2条回答
  • 2020-12-24 11:03

    This is likely because str.startswith() does more than str.__contains__(), and also because I believe str.__contains__ operates fully in C, whereas str.startswith() has to interact with Python types. Its signature is str.startswith(prefix[, start[, end]]), where prefix can be a tuple of strings to try.

    0 讨论(0)
  • 2020-12-24 11:13

    As already mentioned in the comments, if you use s.__contains__("XYZ") you get a result that is more similar to s.startswith("XYZ") because it needs to take the same route: Member lookup on the string object, followed by a function call. This is usually somewhat expensive (not enough that you should worry about of course). On the other hand, when you do "XYZ" in s, the parser interprets the operator and can short-cut the member access to the __contains__ (or rather the implementation behind it, because __contains__ itself is just one way to access the implementation).

    You can get an idea about this by looking at the bytecode:

    >>> dis.dis('"XYZ" in s')
      1           0 LOAD_CONST               0 ('XYZ')
                  3 LOAD_NAME                0 (s)
                  6 COMPARE_OP               6 (in)
                  9 RETURN_VALUE
    >>> dis.dis('s.__contains__("XYZ")')
      1           0 LOAD_NAME                0 (s)
                  3 LOAD_ATTR                1 (__contains__)
                  6 LOAD_CONST               0 ('XYZ')
                  9 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
                 12 RETURN_VALUE
    

    So comparing s.__contains__("XYZ") with s.startswith("XYZ") will produce a more similar result, however for your example string s, the startswith will still be slower.

    To get to that, you could check the implementation of both. Interesting to see for the contains implementation is that it is statically typed, and just assumes that the argument is a unicode object itself. So this is quite efficient.

    The startswith implementation however is a “dynamic” Python method which requires the implementation to actually parse the arguments. startswith also supports a tuple as an argument, which makes the whole start-up of the method a bit slower: (shortened by me, with my comments):

    static PyObject * unicode_startswith(PyObject *self, PyObject *args)
    {
        // argument parsing
        PyObject *subobj;
        PyObject *substring;
        Py_ssize_t start = 0;
        Py_ssize_t end = PY_SSIZE_T_MAX;
        int result;
        if (!stringlib_parse_args_finds("startswith", args, &subobj, &start, &end))
            return NULL;
    
        // tuple handling
        if (PyTuple_Check(subobj)) {}
    
        // unicode conversion
        substring = PyUnicode_FromObject(subobj);
        if (substring == NULL) {}
    
        // actual implementation
        result = tailmatch(self, substring, start, end, -1);
        Py_DECREF(substring);
        if (result == -1)
            return NULL;
        return PyBool_FromLong(result);
    }
    

    This is likely a big reason why startswith is slower for strings for which a contains is fast because of its simplicity.

    0 讨论(0)
提交回复
热议问题