“is” operator behaves unexpectedly with integers

橙三吉。 提交于 2020-03-28 06:53:10

问题


Why does the following behave unexpectedly in Python?

>>> a = 256
>>> b = 256
>>> a is b
True           # This is an expected result
>>> a = 257
>>> b = 257
>>> a is b
False          # What happened here? Why is this False?
>>> 257 is 257
True           # Yet the literal numbers compare properly

I am using Python 2.5.2. Trying some different versions of Python, it appears that Python 2.3.3 shows the above behaviour between 99 and 100.

Based on the above, I can hypothesize that Python is internally implemented such that "small" integers are stored in a different way than larger integers and the is operator can tell the difference. Why the leaky abstraction? What is a better way of comparing two arbitrary objects to see whether they are the same when I don't know in advance whether they are numbers or not?


回答1:


Take a look at this:

>>> a = 256
>>> b = 256
>>> id(a)
9987148
>>> id(b)
9987148
>>> a = 257
>>> b = 257
>>> id(a)
11662816
>>> id(b)
11662828

Here's what I found in the Python 2 documentation, "Plain Integer Objects" (It's the same for Python 3):

The current implementation keeps an array of integer objects for all integers between -5 and 256, when you create an int in that range you actually just get back a reference to the existing object. So it should be possible to change the value of 1. I suspect the behaviour of Python in this case is undefined. :-)




回答2:


Python's “is” operator behaves unexpectedly with integers?

In summary - let me emphasize: Do not use is to compare integers.

This isn't behavior you should have any expectations about.

Instead, use == and != to compare for equality and inequality, respectively. For example:

>>> a = 1000
>>> a == 1000       # Test integers like this,
True
>>> a != 5000       # or this!
True
>>> a is 1000       # Don't do this! - Don't use `is` to test integers!!
False

Explanation

To know this, you need to know the following.

First, what does is do? It is a comparison operator. From the documentation:

The operators is and is not test for object identity: x is y is true if and only if x and y are the same object. x is not y yields the inverse truth value.

And so the following are equivalent.

>>> a is b
>>> id(a) == id(b)

From the documentation:

id Return the “identity” of an object. This is an integer (or long integer) which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value.

Note that the fact that the id of an object in CPython (the reference implementation of Python) is the location in memory is an implementation detail. Other implementations of Python (such as Jython or IronPython) could easily have a different implementation for id.

So what is the use-case for is? PEP8 describes:

Comparisons to singletons like None should always be done with is or is not, never the equality operators.

The Question

You ask, and state, the following question (with code):

Why does the following behave unexpectedly in Python?

>>> a = 256
>>> b = 256
>>> a is b
True           # This is an expected result

It is not an expected result. Why is it expected? It only means that the integers valued at 256 referenced by both a and b are the same instance of integer. Integers are immutable in Python, thus they cannot change. This should have no impact on any code. It should not be expected. It is merely an implementation detail.

But perhaps we should be glad that there is not a new separate instance in memory every time we state a value equals 256.

>>> a = 257
>>> b = 257
>>> a is b
False          # What happened here? Why is this False?

Looks like we now have two separate instances of integers with the value of 257 in memory. Since integers are immutable, this wastes memory. Let's hope we're not wasting a lot of it. We're probably not. But this behavior is not guaranteed.

>>> 257 is 257
True           # Yet the literal numbers compare properly

Well, this looks like your particular implementation of Python is trying to be smart and not creating redundantly valued integers in memory unless it has to. You seem to indicate you are using the referent implementation of Python, which is CPython. Good for CPython.

It might be even better if CPython could do this globally, if it could do so cheaply (as there would a cost in the lookup), perhaps another implementation might.

But as for impact on code, you should not care if an integer is a particular instance of an integer. You should only care what the value of that instance is, and you would use the normal comparison operators for that, i.e. ==.

What is does

is checks that the id of two objects are the same. In CPython, the id is the location in memory, but it could be some other uniquely identifying number in another implementation. To restate this with code:

>>> a is b

is the same as

>>> id(a) == id(b)

Why would we want to use is then?

This can be a very fast check relative to say, checking if two very long strings are equal in value. But since it applies to the uniqueness of the object, we thus have limited use-cases for it. In fact, we mostly want to use it to check for None, which is a singleton (a sole instance existing in one place in memory). We might create other singletons if there is potential to conflate them, which we might check with is, but these are relatively rare. Here's an example (will work in Python 2 and 3) e.g.

SENTINEL_SINGLETON = object() # this will only be created one time.

def foo(keyword_argument=None):
    if keyword_argument is None:
        print('no argument given to foo')
    bar()
    bar(keyword_argument)
    bar('baz')

def bar(keyword_argument=SENTINEL_SINGLETON):
    # SENTINEL_SINGLETON tells us if we were not passed anything
    # as None is a legitimate potential argument we could get.
    if keyword_argument is SENTINEL_SINGLETON:
        print('no argument given to bar')
    else:
        print('argument to bar: {0}'.format(keyword_argument))

foo()

Which prints:

no argument given to foo
no argument given to bar
argument to bar: None
argument to bar: baz

And so we see, with is and a sentinel, we are able to differentiate between when bar is called with no arguments and when it is called with None. These are the primary use-cases for is - do not use it to test for equality of integers, strings, tuples, or other things like these.




回答3:


It depends on whether you're looking to see if 2 things are equal, or the same object.

is checks to see if they are the same object, not just equal. The small ints are probably pointing to the same memory location for space efficiency

In [29]: a = 3
In [30]: b = 3
In [31]: id(a)
Out[31]: 500729144
In [32]: id(b)
Out[32]: 500729144

You should use == to compare equality of arbitrary objects. You can specify the behavior with the __eq__, and __ne__ attributes.




回答4:


I'm late but, you want some source with your answer? I'll try and word this in an introductory manner so more folks can follow along.


A good thing about CPython is that you can actually see the source for this. I'm going to use links for the 3.5 release, but finding the corresponding 2.x ones is trivial.

In CPython, the C-API function that handles creating a new int object is PyLong_FromLong(long v). The description for this function is:

The current implementation keeps an array of integer objects for all integers between -5 and 256, when you create an int in that range you actually just get back a reference to the existing object. So it should be possible to change the value of 1. I suspect the behaviour of Python in this case is undefined. :-)

(My italics)

Don't know about you but I see this and think: Let's find that array!

If you haven't fiddled with the C code implementing CPython you should; everything is pretty organized and readable. For our case, we need to look in the Objects subdirectory of the main source code directory tree.

PyLong_FromLong deals with long objects so it shouldn't be hard to deduce that we need to peek inside longobject.c. After looking inside you might think things are chaotic; they are, but fear not, the function we're looking for is chilling at line 230 waiting for us to check it out. It's a smallish function so the main body (excluding declarations) is easily pasted here:

PyObject *
PyLong_FromLong(long ival)
{
    // omitting declarations

    CHECK_SMALL_INT(ival);

    if (ival < 0) {
        /* negate: cant write this as abs_ival = -ival since that
           invokes undefined behaviour when ival is LONG_MIN */
        abs_ival = 0U-(unsigned long)ival;
        sign = -1;
    }
    else {
        abs_ival = (unsigned long)ival;
    }

    /* Fast path for single-digit ints */
    if (!(abs_ival >> PyLong_SHIFT)) {
        v = _PyLong_New(1);
        if (v) {
            Py_SIZE(v) = sign;
            v->ob_digit[0] = Py_SAFE_DOWNCAST(
                abs_ival, unsigned long, digit);
        }
        return (PyObject*)v; 
}

Now, we're no C master-code-haxxorz but we're also not dumb, we can see that CHECK_SMALL_INT(ival); peeking at us all seductively; we can understand it has something to do with this. Let's check it out:

#define CHECK_SMALL_INT(ival) \
    do if (-NSMALLNEGINTS <= ival && ival < NSMALLPOSINTS) { \
        return get_small_int((sdigit)ival); \
    } while(0)

So it's a macro that calls function get_small_int if the value ival satisfies the condition:

if (-NSMALLNEGINTS <= ival && ival < NSMALLPOSINTS)

So what are NSMALLNEGINTS and NSMALLPOSINTS? Macros! Here they are:

#ifndef NSMALLPOSINTS
#define NSMALLPOSINTS           257
#endif
#ifndef NSMALLNEGINTS
#define NSMALLNEGINTS           5
#endif

So our condition is if (-5 <= ival && ival < 257) call get_small_int.

Next let's look at get_small_int in all its glory (well, we'll just look at its body because that's where the interesting things are):

PyObject *v;
assert(-NSMALLNEGINTS <= ival && ival < NSMALLPOSINTS);
v = (PyObject *)&small_ints[ival + NSMALLNEGINTS];
Py_INCREF(v);

Okay, declare a PyObject, assert that the previous condition holds and execute the assignment:

v = (PyObject *)&small_ints[ival + NSMALLNEGINTS];

small_ints looks a lot like that array we've been searching for, and it is! We could've just read the damn documentation and we would've know all along!:

/* Small integers are preallocated in this array so that they
   can be shared.
   The integers that are preallocated are those in the range
   -NSMALLNEGINTS (inclusive) to NSMALLPOSINTS (not inclusive).
*/
static PyLongObject small_ints[NSMALLNEGINTS + NSMALLPOSINTS];

So yup, this is our guy. When you want to create a new int in the range [NSMALLNEGINTS, NSMALLPOSINTS) you'll just get back a reference to an already existing object that has been preallocated.

Since the reference refers to the same object, issuing id() directly or checking for identity with is on it will return exactly the same thing.

But, when are they allocated??

During initialization in _PyLong_Init Python will gladly enter in a for loop do do this for you:

for (ival = -NSMALLNEGINTS; ival <  NSMALLPOSINTS; ival++, v++) {

Check out the source to read the loop body!

I hope my explanation has made you C things clearly now (pun obviously intented).


But, 257 is 257? What's up?

This is actually easier to explain, and I have attempted to do so already; it's due to the fact that Python will execute this interactive statement as a single block:

>>> 257 is 257

During complilation of this statement, CPython will see that you have two matching literals and will use the same PyLongObject representing 257. You can see this if you do the compilation yourself and examine its contents:

>>> codeObj = compile("257 is 257", "blah!", "exec")
>>> codeObj.co_consts
(257, None)

When CPython does the operation, it's now just going to load the exact same object:

>>> import dis
>>> dis.dis(codeObj)
  1           0 LOAD_CONST               0 (257)   # dis
              3 LOAD_CONST               0 (257)   # dis again
              6 COMPARE_OP               8 (is)

So is will return True.




回答5:


As you can check in source file intobject.c, Python caches small integers for efficiency. Every time you create a reference to a small integer, you are referring the cached small integer, not a new object. 257 is not an small integer, so it is calculated as a different object.

It is better to use == for that purpose.




回答6:


I think your hypotheses is correct. Experiment with id (identity of object):

In [1]: id(255)
Out[1]: 146349024

In [2]: id(255)
Out[2]: 146349024

In [3]: id(257)
Out[3]: 146802752

In [4]: id(257)
Out[4]: 148993740

In [5]: a=255

In [6]: b=255

In [7]: c=257

In [8]: d=257

In [9]: id(a), id(b), id(c), id(d)
Out[9]: (146349024, 146349024, 146783024, 146804020)

It appears that numbers <= 255 are treated as literals and anything above is treated differently!




回答7:


For immutable value objects, like ints, strings or datetimes, object identity is not especially useful. It's better to think about equality. Identity is essentially an implementation detail for value objects - since they're immutable, there's no effective difference between having multiple refs to the same object or multiple objects.




回答8:


There's another issue that isn't pointed out in any of the existing answers. Python is allowed to merge any two immutable values, and pre-created small int values are not the only way this can happen. A Python implementation is never guaranteed to do this, but they all do it for more than just small ints.


For one thing, there are some other pre-created values, such as the empty tuple, str, and bytes, and some short strings (in CPython 3.6, it's the 256 single-character Latin-1 strings). For example:

>>> a = ()
>>> b = ()
>>> a is b
True

But also, even non-pre-created values can be identical. Consider these examples:

>>> c = 257
>>> d = 257
>>> c is d
False
>>> e, f = 258, 258
>>> e is f
True

And this isn't limited to int values:

>>> g, h = 42.23e100, 42.23e100
>>> g is h
True

Obviously, CPython doesn't come with a pre-created float value for 42.23e100. So, what's going on here?

The CPython compiler will merge constant values of some known-immutable types like int, float, str, bytes, in the same compilation unit. For a module, the whole module is a compilation unit, but at the interactive interpreter, each statement is a separate compilation unit. Since c and d are defined in separate statements, their values aren't merged. Since e and f are defined in the same statement, their values are merged.


You can see what's going on by disassembling the bytecode. Try defining a function that does e, f = 128, 128 and then calling dis.dis on it, and you'll see that there's a single constant value (128, 128)

>>> def f(): i, j = 258, 258
>>> dis.dis(f)
  1           0 LOAD_CONST               2 ((128, 128))
              2 UNPACK_SEQUENCE          2
              4 STORE_FAST               0 (i)
              6 STORE_FAST               1 (j)
              8 LOAD_CONST               0 (None)
             10 RETURN_VALUE
>>> f.__code__.co_consts
(None, 128, (128, 128))
>>> id(f.__code__.co_consts[1], f.__code__.co_consts[2][0], f.__code__.co_consts[2][1])
4305296480, 4305296480, 4305296480

You may notice that the compiler has stored 128 as a constant even though it's not actually used by the bytecode, which gives you an idea of how little optimization CPython's compiler does. Which means that (non-empty) tuples actually don't end up merged:

>>> k, l = (1, 2), (1, 2)
>>> k is l
False

Put that in a function, dis it, and look at the co_consts—there's a 1 and a 2, two (1, 2) tuples that share the same 1 and 2 but are not identical, and a ((1, 2), (1, 2)) tuple that has the two distinct equal tuples.


There's one more optimization that CPython does: string interning. Unlike compiler constant folding, this isn't restricted to source code literals:

>>> m = 'abc'
>>> n = 'abc'
>>> m is n
True

On the other hand, it is limited to the str type, and to strings of internal storage kind "ascii compact", "compact", or "legacy ready", and in many cases only "ascii compact" will get interned.


At any rate, the rules for what values must be, might be, or cannot be distinct vary from implementation to implementation, and between versions of the same implementation, and maybe even between runs of the same code on the same copy of the same implementation.

It can be worth learning the rules for one specific Python for the fun of it. But it's not worth relying on them in your code. The only safe rule is:

  • Do not write code that assumes two equal but separately-created immutable values are identical (don't use x is y, use x == y)
  • Do not write code that assumes two equal but separately-created immutable values are distinct (don't use x is not y, use x != y)

Or, in other words, only use is to test for the documented singletons (like None) or that are only created in one place in the code (like the _sentinel = object() idiom).




回答9:


is is the identity equality operator (functioning like id(a) == id(b)); it's just that two equal numbers aren't necessarily the same object. For performance reasons some small integers happen to be memoized so they will tend to be the same (this can be done since they are immutable).

PHP's === operator, on the other hand, is described as checking equality and type: x == y and type(x) == type(y) as per Paulo Freitas' comment. This will suffice for common numbers, but differ from is for classes that define __eq__ in an absurd manner:

class Unequal:
    def __eq__(self, other):
        return False

PHP apparently allows the same thing for "built-in" classes (which I take to mean implemented at C level, not in PHP). A slightly less absurd use might be a timer object, which has a different value every time it's used as a number. Quite why you'd want to emulate Visual Basic's Now instead of showing that it is an evaluation with time.time() I don't know.

Greg Hewgill (OP) made one clarifying comment "My goal is to compare object identity, rather than equality of value. Except for numbers, where I want to treat object identity the same as equality of value."

This would have yet another answer, as we have to categorize things as numbers or not, to select whether we compare with == or is. CPython defines the number protocol, including PyNumber_Check, but this is not accessible from Python itself.

We could try to use isinstance with all the number types we know of, but this would inevitably be incomplete. The types module contains a StringTypes list but no NumberTypes. Since Python 2.6, the built in number classes have a base class numbers.Number, but it has the same problem:

import numpy, numbers
assert not issubclass(numpy.int16,numbers.Number)
assert issubclass(int,numbers.Number)

By the way, NumPy will produce separate instances of low numbers.

I don't actually know an answer to this variant of the question. I suppose one could theoretically use ctypes to call PyNumber_Check, but even that function has been debated, and it's certainly not portable. We'll just have to be less particular about what we test for now.

In the end, this issue stems from Python not originally having a type tree with predicates like Scheme's number?, or Haskell's type class Num. is checks object identity, not value equality. PHP has a colorful history as well, where === apparently behaves as is only on objects in PHP5, but not PHP4. Such are the growing pains of moving across languages (including versions of one).




回答10:


It also happens with strings:

>>> s = b = 'somestr'
>>> s == b, s is b, id(s), id(b)
(True, True, 4555519392, 4555519392)

Now everything seems fine.

>>> s = 'somestr'
>>> b = 'somestr'
>>> s == b, s is b, id(s), id(b)
(True, True, 4555519392, 4555519392)

That's expected too.

>>> s1 = b1 = 'somestrdaasd ad ad asd as dasddsg,dlfg ,;dflg, dfg a'
>>> s1 == b1, s1 is b1, id(s1), id(b1)
(True, True, 4555308080, 4555308080)

>>> s1 = 'somestrdaasd ad ad asd as dasddsg,dlfg ,;dflg, dfg a'
>>> b1 = 'somestrdaasd ad ad asd as dasddsg,dlfg ,;dflg, dfg a'
>>> s1 == b1, s1 is b1, id(s1), id(b1)
(True, False, 4555308176, 4555308272)

Now that's unexpected.




回答11:


What’s New In Python 3.8: Changes in Python behavior:

The compiler now produces a SyntaxWarning when identity checks (is and is not) are used with certain types of literals (e.g. strings, ints). These can often work by accident in CPython, but are not guaranteed by the language spec. The warning advises users to use equality tests (== and !=) instead.



来源:https://stackoverflow.com/questions/60744998/why-does-python-interpreter-assign-the-same-identity-to-two-variables

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!