'is' operator behaves differently when comparing strings with spaces

耗尽温柔 提交于 2019-11-26 15:27:35
Elazar

Warning: this answer is about the implementation details of a specific python interpreter. comparing strings with is==bad idea.

Well, at least for cpython3.4/2.7.3, the answer is "no, it is not the whitespace". Not only the whitespace:

  • Two string literals will share memory if they are either alphanumeric or reside on the same block (file, function, class or single interpreter command)

  • An expression that evaluates to a string will result in an object that is identical to the one created using a string literal, if and only if it is created using constants and binary/unary operators, and the resulting string is shorter than 21 characters.

  • Single characters are unique.

Examples

Alphanumeric string literals always share memory:

>>> x='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
>>> y='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
>>> x is y
True

Non-alphanumeric string literals share memory if and only if they share the enclosing syntactic block:

(interpreter)

>>> x='`!@#$%^&*() \][=-. >:"?<a'; y='`!@#$%^&*() \][=-. >:"?<a';
>>> z='`!@#$%^&*() \][=-. >:"?<a';
>>> x is y
True 
>>> x is z
False 

(file)

x='`!@#$%^&*() \][=-. >:"?<a';
y='`!@#$%^&*() \][=-. >:"?<a';
z=(lambda : '`!@#$%^&*() \][=-. >:"?<a')()
print(x is y)
print(x is z)

Output: True and False

For simple binary operations, the compiler is doing very simple constant propagation (see peephole.c), but with strings it does so only if the resulting string is shorter than 21 charcters. If this is the case, the rules mentioned earlier are in force:

>>> 'a'*10+'a'*10 is 'a'*20
True
>>> 'a'*21 is 'a'*21
False
>>> 'aaaaaaaaaaaaaaaaaaaaa' is 'aaaaaaaa' + 'aaaaaaaaaaaaa'
False
>>> t=2; 'a'*t is 'aa'
False
>>> 'a'.__add__('a') is 'aa'
False
>>> x='a' ; x+='a'; x is 'aa'
False

Single characters always share memory, of course:

>>> chr(0x20) is ' '
True

To expand on Ignacio’s answer a bit: The is operator is the identity operator. It is used to compare object identity. If you construct two objects with the same contents, then it is usually not the case that the object identity yields true. It works for some small strings because CPython, the reference implementation of Python, stores the contents separately, making all those objects reference to the same string content. So the is operator returns true for those.

This however is an implementation detail of CPython and is generally neither guaranteed for CPython nor any other implementation. So using this fact is a bad idea as it can break any other day.

To compare strings, you use the == operator which compares the equality of objects. Two string objects are considered equal when they contain the same characters. So this is the correct operator to use when comparing strings, and is should be generally avoided if you do not explicitely want object identity (example: a is False).


If you are really interested in the details, you can find the implementation of CPython’s strings here. But again: This is implementation detail, so you should never require this to work.

The is operator relies on the id function, which is guaranteed to be unique among simultaneously existing objects. Specifically, id returns the object's memory address. It seems that CPython has consistent memory addresses for strings containing only characters a-z and A-Z.

However, this seems to only be the case when the string has been assigned to a variable:

Here, the id of "foo" and the id of a are the same. a has been set to "foo" prior to checking the id.

>>> a = "foo"
>>> id(a)
4322269384
>>> id("foo")
4322269384

However, the id of "bar" and the id of a are different when checking the id of "bar" prior to setting a equal to "bar".

>>> id("bar")
4322269224
>>> a = "bar"
>>> id(a)
4322268984

Checking the id of "bar" again after setting a equal to "bar" returns the same id.

>>> id("bar")
4322268984

So it seems that cPython keeps consistent memory addresses for strings containing only a-zA-Z when those strings are assigned to a variable. It's also entirely possible that this is version dependent: I'm running python 2.7.3 on a macbook. Others might get entirely different results.

In fact your code amounts to comparing objects id (i.e. their physical address). So instead of your is comparison:

>>> b = 'is it the space?'
>>> a = 'is it the space?'
>>> a is b
False

You can do:

>>> id(a) == id(b)
False

But, note that if a and b were directly in the comparison it would work.

>>> id('is it the space?') == id('is it the space?')
True

In fact, in an expression there's sharing between the same static strings. But, at the program scale there's only sharing for word-like strings (so neither spaces nor punctuations).

You should not rely on this behavior as it's not documented anywhere and is a detail of implementation.

'is' operator compare the actual object.

c is d should also be false. My guess is that python make some optimization and in that case, it is the same object.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!