Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?

前端 未结 11 1375
梦毁少年i
梦毁少年i 2020-11-22 03:46

It is my understanding that the range() function, which is actually an object type in Python 3, generates its contents on the fly, similar to a generator.

相关标签:
11条回答
  • The other answers explained it well already, but I'd like to offer another experiment illustrating the nature of range objects:

    >>> r = range(5)
    >>> for i in r:
            print(i, 2 in r, list(r))
    
    0 True [0, 1, 2, 3, 4]
    1 True [0, 1, 2, 3, 4]
    2 True [0, 1, 2, 3, 4]
    3 True [0, 1, 2, 3, 4]
    4 True [0, 1, 2, 3, 4]
    

    As you can see, a range object is an object that remembers its range and can be used many times (even while iterating over it), not just a one-time generator.

    0 讨论(0)
  • 2020-11-22 04:25

    It's all about a lazy approach to the evaluation and some extra optimization of range. Values in ranges don't need to be computed until real use, or even further due to extra optimization.

    By the way, your integer is not such big, consider sys.maxsize

    sys.maxsize in range(sys.maxsize) is pretty fast

    due to optimization - it's easy to compare given integer just with min and max of range.

    but:

    Decimal(sys.maxsize) in range(sys.maxsize) is pretty slow.

    (in this case, there is no optimization in range, so if python receives unexpected Decimal, python will compare all numbers)

    You should be aware of an implementation detail but should not be relied upon, because this may change in the future.

    0 讨论(0)
  • 2020-11-22 04:26

    Use the source, Luke!

    In CPython, range(...).__contains__ (a method wrapper) will eventually delegate to a simple calculation which checks if the value can possibly be in the range. The reason for the speed here is we're using mathematical reasoning about the bounds, rather than a direct iteration of the range object. To explain the logic used:

    1. Check that the number is between start and stop, and
    2. Check that the stride value doesn't "step over" our number.

    For example, 994 is in range(4, 1000, 2) because:

    1. 4 <= 994 < 1000, and
    2. (994 - 4) % 2 == 0.

    The full C code is included below, which is a bit more verbose because of memory management and reference counting details, but the basic idea is there:

    static int
    range_contains_long(rangeobject *r, PyObject *ob)
    {
        int cmp1, cmp2, cmp3;
        PyObject *tmp1 = NULL;
        PyObject *tmp2 = NULL;
        PyObject *zero = NULL;
        int result = -1;
    
        zero = PyLong_FromLong(0);
        if (zero == NULL) /* MemoryError in int(0) */
            goto end;
    
        /* Check if the value can possibly be in the range. */
    
        cmp1 = PyObject_RichCompareBool(r->step, zero, Py_GT);
        if (cmp1 == -1)
            goto end;
        if (cmp1 == 1) { /* positive steps: start <= ob < stop */
            cmp2 = PyObject_RichCompareBool(r->start, ob, Py_LE);
            cmp3 = PyObject_RichCompareBool(ob, r->stop, Py_LT);
        }
        else { /* negative steps: stop < ob <= start */
            cmp2 = PyObject_RichCompareBool(ob, r->start, Py_LE);
            cmp3 = PyObject_RichCompareBool(r->stop, ob, Py_LT);
        }
    
        if (cmp2 == -1 || cmp3 == -1) /* TypeError */
            goto end;
        if (cmp2 == 0 || cmp3 == 0) { /* ob outside of range */
            result = 0;
            goto end;
        }
    
        /* Check that the stride does not invalidate ob's membership. */
        tmp1 = PyNumber_Subtract(ob, r->start);
        if (tmp1 == NULL)
            goto end;
        tmp2 = PyNumber_Remainder(tmp1, r->step);
        if (tmp2 == NULL)
            goto end;
        /* result = ((int(ob) - start) % step) == 0 */
        result = PyObject_RichCompareBool(tmp2, zero, Py_EQ);
      end:
        Py_XDECREF(tmp1);
        Py_XDECREF(tmp2);
        Py_XDECREF(zero);
        return result;
    }
    
    static int
    range_contains(rangeobject *r, PyObject *ob)
    {
        if (PyLong_CheckExact(ob) || PyBool_Check(ob))
            return range_contains_long(r, ob);
    
        return (int)_PySequence_IterSearch((PyObject*)r, ob,
                                           PY_ITERSEARCH_CONTAINS);
    }
    

    The "meat" of the idea is mentioned in the line:

    /* result = ((int(ob) - start) % step) == 0 */ 
    

    As a final note - look at the range_contains function at the bottom of the code snippet. If the exact type check fails then we don't use the clever algorithm described, instead falling back to a dumb iteration search of the range using _PySequence_IterSearch! You can check this behaviour in the interpreter (I'm using v3.5.0 here):

    >>> x, r = 1000000000000000, range(1000000000000001)
    >>> class MyInt(int):
    ...     pass
    ... 
    >>> x_ = MyInt(x)
    >>> x in r  # calculates immediately :) 
    True
    >>> x_ in r  # iterates for ages.. :( 
    ^\Quit (core dumped)
    
    0 讨论(0)
  • 2020-11-22 04:29

    If you're wondering why this optimization was added to range.__contains__, and why it wasn't added to xrange.__contains__ in 2.7:

    First, as Ashwini Chaudhary discovered, issue 1766304 was opened explicitly to optimize [x]range.__contains__. A patch for this was accepted and checked in for 3.2, but not backported to 2.7 because "xrange has behaved like this for such a long time that I don't see what it buys us to commit the patch this late." (2.7 was nearly out at that point.)

    Meanwhile:

    Originally, xrange was a not-quite-sequence object. As the 3.1 docs say:

    Range objects have very little behavior: they only support indexing, iteration, and the len function.

    This wasn't quite true; an xrange object actually supported a few other things that come automatically with indexing and len,* including __contains__ (via linear search). But nobody thought it was worth making them full sequences at the time.

    Then, as part of implementing the Abstract Base Classes PEP, it was important to figure out which builtin types should be marked as implementing which ABCs, and xrange/range claimed to implement collections.Sequence, even though it still only handled the same "very little behavior". Nobody noticed that problem until issue 9213. The patch for that issue not only added index and count to 3.2's range, it also re-worked the optimized __contains__ (which shares the same math with index, and is directly used by count).** This change went in for 3.2 as well, and was not backported to 2.x, because "it's a bugfix that adds new methods". (At this point, 2.7 was already past rc status.)

    So, there were two chances to get this optimization backported to 2.7, but they were both rejected.


    * In fact, you even get iteration for free with indexing alone, but in 2.3 xrange objects got a custom iterator.

    ** The first version actually reimplemented it, and got the details wrong—e.g., it would give you MyIntSubclass(2) in range(5) == False. But Daniel Stutzbach's updated version of the patch restored most of the previous code, including the fallback to the generic, slow _PySequence_IterSearch that pre-3.2 range.__contains__ was implicitly using when the optimization doesn't apply.

    0 讨论(0)
  • 2020-11-22 04:34

    TL;DR

    The object returned by range() is actually a range object. This object implements the iterator interface so you can iterate over its values sequentially, just like a generator, list, or tuple.

    But it also implements the __contains__ interface which is actually what gets called when an object appears on the right hand side of the in operator. The __contains__() method returns a bool of whether or not the item on the left-hand-side of the in is in the object. Since range objects know their bounds and stride, this is very easy to implement in O(1).

    0 讨论(0)
提交回复
热议问题