It is my understanding that the range()
function, which is actually an object type in Python 3, generates its contents on the fly, similar to a generator.
The Python 3 range()
object doesn't produce numbers immediately; it is a smart sequence object that produces numbers on demand. All it contains is your start, stop and step values, then as you iterate over the object the next integer is calculated each iteration.
The object also implements the object.__contains__ hook, and calculates if your number is part of its range. Calculating is a (near) constant time operation *. There is never a need to scan through all possible integers in the range.
From the range() object documentation:
The advantage of the
range
type over a regularlist
ortuple
is that a range object will always take the same (small) amount of memory, no matter the size of the range it represents (as it only stores thestart
,stop
andstep
values, calculating individual items and subranges as needed).
So at a minimum, your range()
object would do:
class my_range(object):
def __init__(self, start, stop=None, step=1):
if stop is None:
start, stop = 0, start
self.start, self.stop, self.step = start, stop, step
if step < 0:
lo, hi, step = stop, start, -step
else:
lo, hi = start, stop
self.length = 0 if lo > hi else ((hi - lo - 1) // step) + 1
def __iter__(self):
current = self.start
if self.step < 0:
while current > self.stop:
yield current
current += self.step
else:
while current < self.stop:
yield current
current += self.step
def __len__(self):
return self.length
def __getitem__(self, i):
if i < 0:
i += self.length
if 0 <= i < self.length:
return self.start + i * self.step
raise IndexError('Index out of range: {}'.format(i))
def __contains__(self, num):
if self.step < 0:
if not (self.stop < num <= self.start):
return False
else:
if not (self.start <= num < self.stop):
return False
return (num - self.start) % self.step == 0
This is still missing several things that a real range()
supports (such as the .index()
or .count()
methods, hashing, equality testing, or slicing), but should give you an idea.
I also simplified the __contains__
implementation to only focus on integer tests; if you give a real range()
object a non-integer value (including subclasses of int
), a slow scan is initiated to see if there is a match, just as if you use a containment test against a list of all the contained values. This was done to continue to support other numeric types that just happen to support equality testing with integers but are not expected to support integer arithmetic as well. See the original Python issue that implemented the containment test.
* Near constant time because Python integers are unbounded and so math operations also grow in time as N grows, making this a O(log N) operation. Since it’s all executed in optimised C code and Python stores integer values in 30-bit chunks, you’d run out of memory before you saw any performance impact due to the size of the integers involved here.
TLDR; range is an arithmetic series so it can very easily calculate whether the object is there.It could even get the index of it if it were list like really quickly.
Try x-1 in (i for i in range(x))
for large x
values, which uses a generator comprehension to avoid invoking the range.__contains__
optimisation.
The fundamental misunderstanding here is in thinking that range
is a generator. It's not. In fact, it's not any kind of iterator.
You can tell this pretty easily:
>>> a = range(5)
>>> print(list(a))
[0, 1, 2, 3, 4]
>>> print(list(a))
[0, 1, 2, 3, 4]
If it were a generator, iterating it once would exhaust it:
>>> b = my_crappy_range(5)
>>> print(list(b))
[0, 1, 2, 3, 4]
>>> print(list(b))
[]
What range
actually is, is a sequence, just like a list. You can even test this:
>>> import collections.abc
>>> isinstance(a, collections.abc.Sequence)
True
This means it has to follow all the rules of being a sequence:
>>> a[3] # indexable
3
>>> len(a) # sized
5
>>> 3 in a # membership
True
>>> reversed(a) # reversible
<range_iterator at 0x101cd2360>
>>> a.index(3) # implements 'index'
3
>>> a.count(3) # implements 'count'
1
The difference between a range
and a list
is that a range
is a lazy or dynamic sequence; it doesn't remember all of its values, it just remembers its start
, stop
, and step
, and creates the values on demand on __getitem__
.
(As a side note, if you print(iter(a))
, you'll notice that range
uses the same listiterator
type as list
. How does that work? A listiterator
doesn't use anything special about list
except for the fact that it provides a C implementation of __getitem__
, so it works fine for range
too.)
Now, there's nothing that says that Sequence.__contains__
has to be constant time—in fact, for obvious examples of sequences like list
, it isn't. But there's nothing that says it can't be. And it's easier to implement range.__contains__
to just check it mathematically ((val - start) % step
, but with some extra complexity to deal with negative steps) than to actually generate and test all the values, so why shouldn't it do it the better way?
But there doesn't seem to be anything in the language that guarantees this will happen. As Ashwini Chaudhari points out, if you give it a non-integral value, instead of converting to integer and doing the mathematical test, it will fall back to iterating all the values and comparing them one by one. And just because CPython 3.2+ and PyPy 3.x versions happen to contain this optimization, and it's an obvious good idea and easy to do, there's no reason that IronPython or NewKickAssPython 3.x couldn't leave it out. (And in fact CPython 3.0-3.1 didn't include it.)
If range
actually were a generator, like my_crappy_range
, then it wouldn't make sense to test __contains__
this way, or at least the way it makes sense wouldn't be obvious. If you'd already iterated the first 3 values, is 1
still in
the generator? Should testing for 1
cause it to iterate and consume all the values up to 1
(or up to the first value >= 1
)?
Take an example, 997 is in range(4, 1000, 3) because:
4 <= 997 < 1000, and (997 - 4) % 3 == 0.
To add to Martijn’s answer, this is the relevant part of the source (in C, as the range object is written in native code):
static int
range_contains(rangeobject *r, PyObject *ob)
{
if (PyLong_CheckExact(ob) || PyBool_Check(ob))
return range_contains_long(r, ob);
return (int)_PySequence_IterSearch((PyObject*)r, ob,
PY_ITERSEARCH_CONTAINS);
}
So for PyLong
objects (which is int
in Python 3), it will use the range_contains_long
function to determine the result. And that function essentially checks if ob
is in the specified range (although it looks a bit more complex in C).
If it’s not an int
object, it falls back to iterating until it finds the value (or not).
The whole logic could be translated to pseudo-Python like this:
def range_contains (rangeObj, obj):
if isinstance(obj, int):
return range_contains_long(rangeObj, obj)
# default logic by iterating
return any(obj == x for x in rangeObj)
def range_contains_long (r, num):
if r.step > 0:
# positive step: r.start <= num < r.stop
cmp2 = r.start <= num
cmp3 = num < r.stop
else:
# negative step: r.start >= num > r.stop
cmp2 = num <= r.start
cmp3 = r.stop < num
# outside of the range boundaries
if not cmp2 or not cmp3:
return False
# num must be on a valid step inside the boundaries
return (num - r.start) % r.step == 0