Can someone explain this to me? So I\'ve been playing with the id() command in python and came across this:
>>> id(\'cat\')
5181152
>>> a = \'c
Python reuses string literals fairly aggressively. The rules by which it does so are implementation-dependent, but CPython uses two that I'm aware of:
"cat"
, it always refers to the same string object.
def foo(): return "pack my box with five dozen liquor jugs"
def bar(): return "pack my box with five dozen liquor jugs"
assert foo() is bar() # AssertionError
Both optimizations are done at compile time (that is, when the bytecode is generated).
On the other hand, something like chr(99) + chr(97) + chr(116)
is a string expression that evaluates to the string "cat"
. In a dynamic language like Python, its value can't be known at compile time (chr()
is a built-in function, but you might have reassigned it) so it normally isn't interned. Thus its id()
is different from that of "cat"
. However, you can force a string to be interned using the intern()
function. Thus:
id(intern(chr(99) + chr(97) + chr(116))) == id("cat") # True
As others have mentioned, interning is possible because strings are immutable. It isn't possible to change "cat"
to "dog"
, in other words. You have to generate a new string object, which means that there's no danger that other names pointing to the same string will be affected.
Just as an aside, Python also converts expressions containing only constants (like "c" + "a" + "t"
) to constants at compile time, as the below disassembly shows. These will be optimized to point to identical string objects per the rules above.
>>> def foo(): "c" + "a" + "t"
...
>>> from dis import dis; dis(foo)
1 0 LOAD_CONST 5 ('cat')
3 POP_TOP
4 LOAD_CONST 0 (None)
7 RETURN_VALUE
The code you posted creates new strings as intermediate objects. These created strings eventually have the same contents as your originals. In the intermediate time period, they do not exactly match the original, and must be kept at a distinct address.
>>> id('cat')
5181152
As others have answered, by issuing these instructions, you cause the Python VM to create a string object containing the string "cat". This string object is cached and is at address 5181152.
>>> a = 'cat'
>>> id(a)
5181152
Again, a has been assigned to refer to this cached string object at 5181152, containing "cat".
>>> a = a[0:2]
>>> id(a)
27731511
At this point in my modified version of your program, you have created two small string objects: 'cat'
and 'ca'
. 'cat'
still exists in the cache. The string to which a
refers is a different and probably novel string object, containing the characters 'ca'
.
>>> a = a + 't'
>>> id(a)
39964224
Now you have created another new string object. This object is the concatenation of the string 'ca'
at address 27731511, and the string 't'
. This concatenation does match the previously-cached string 'cat'
. Python does not automatically detect this case. As kindall indicated, you can force the search with the intern()
method.
Hopefully this explanation illuminates the steps by which the address of a
changed.
Your code did not include the intermediate state with a
assigned the string 'ca'
. The answer still applies, because the Python interpreter does generate a new string object to hold the intermediate result a[0:2]
, whether you assign that intermediate result to a variable or not.
All values must reside somewhere in memory. This is why id('cat')
produces a value. You call it a "non-existent" string, but it clearly does exist, it just hasn't been assigned to a name yet.
Strings are immutable, so the interpreter can do clever things like make all instances of the literal 'cat'
be the same object, so that id(a)
and id(b)
are the same.
Operating on strings will produce new strings. These may or may not be the same strings as previous strings with the same content.
Note that all of these details are implementations details of CPython, and they can change at any time. You don't need to be concerned with these issues in actual programs.
'cat'
has an address because you create it in order to pass it to id()
. You haven't yet bound it to a name, but the object still exists.
Python caches and reuses short strings. But if you assemble strings by concatenation, then the code that searches the cache and attempts re-use is bypassed.
Note that the inner workings of the string cache is pure implementation detail and should not be relied upon.
Python variables are rather unlike variables in other languages (say, C).
In many other languages, a variable is a name for a location in memory. In these languages, Different kinds of variables can refer to different kinds of locations, and the same location could be given multiple names. For the most part, a given memory location can have the data change from time to time. There are also ways to refer to memory locations indirectly (int *p
would contain the address, and in the memory location at that address, there's an integer.) But a the actual location a variable references cannot change; The variable is the location. A variable assignment in these languages is effectively "Look up the location for this variable, and copy this data into that location"
Python doesn't work that way. In python, actual objects go in some memory location, and variables are like tags for locations. Python manages the stored values in a separate way from how it manages the variables. Essentially, an assignment in python means "Look up the information for this variable, forget the location it already refers to, and replace that with this new location". No data is copied.
A common feature of langauges that work like python (as opposed to the first kind we were talking about earlier) is that some kinds of objects are managed in a special way; identical values are cached so that they don't take up extra memory, and so that they can be compared very easily (if they have the same address, they are equal). This process is called interning; All python string literals are interned (in addition to a few other types), although dynamically created strings may not be.
In your exact code, The semantic dialog would be:
# before anything, since 'cat' is a literal constant, add it to the intern cache
>>> id('cat') # grab the constant 'cat' from the intern cache and look up
# it's address
5181152
>>> a = 'cat' # grab the constant 'cat' from the intern cache and
# make the variable "a" point to it's location
>>> b = 'cat' # do the same thing with the variable "b"
>>> id(a) # look up the object "a" currently points to,
# then look up that object's address
5181152
>>> id(b) # look up the object "b" currently points to,
# then look up that object's address
5181152