A question regarding string instance uniqueness in python

前端 未结 4 1748
一个人的身影
一个人的身影 2020-12-21 08:37

I was trying to figure out which integers python only instantiates once (-6 to 256 it seems), and in the process stumbled on some string behaviour I can\'t see the pattern i

相关标签:
4条回答
  • 2020-12-21 09:00

    I believe short strings that can be evaluated at compile time, will be interned automatically. In the last examples, the result can't be evaluated at compile time because str or join might be redefined.

    0 讨论(0)
  • 2020-12-21 09:06

    Python is allowed to inline string constants; A,B,C,D are actually the same literals (if Python sees a constant expression, it treats it as a constant).

    str is actually a class, so str(whatever) is calling this class' constructor, which should yield a fresh object. This explains E,F,G (note that each of these has separate identity).

    As for H, I am not sure, but I'd go for explanation that this expression is too complicated for Python to figure out it's actually a constant, so it computes a new string.

    0 讨论(0)
  • 2020-12-21 09:16

    in answer to S.Lott's suggestion of examining the byte code:

    import dis
    def moo():
        A = "10000"
        B = "10000"
        C = "100" + "00"
        D = "%i"%10000
        E = str(10000)
        F = str(10000)
        G = "1000"+str(0)
        H = "0".join(("10","00"))
        I = str("10000")
    
        for obj in (A,B,C,D,E,F,G,H, I):
            print obj, id(obj), obj is A
    moo()
    print dis.dis(moo)
    

    yields:

    10000 4968128 True
    10000 4968128 True
    10000 4968128 True
    10000 4968128 True
    10000 2840928 False
    10000 2840896 False
    10000 2840864 False
    10000 2840832 False
    10000 4968128 True
      4           0 LOAD_CONST               1 ('10000')
                  3 STORE_FAST               0 (A)
    
      5           6 LOAD_CONST               1 ('10000')
                  9 STORE_FAST               1 (B)
    
      6          12 LOAD_CONST              10 ('10000')
                 15 STORE_FAST               2 (C)
    
      7          18 LOAD_CONST              11 ('10000')
                 21 STORE_FAST               3 (D)
    
      8          24 LOAD_GLOBAL              0 (str)
                 27 LOAD_CONST               5 (10000)
                 30 CALL_FUNCTION            1
                 33 STORE_FAST               4 (E)
    
      9          36 LOAD_GLOBAL              0 (str)
                 39 LOAD_CONST               5 (10000)
                 42 CALL_FUNCTION            1
                 45 STORE_FAST               5 (F)
    
     10          48 LOAD_CONST               6 ('1000')
                 51 LOAD_GLOBAL              0 (str)
                 54 LOAD_CONST               7 (0)
                 57 CALL_FUNCTION            1
                 60 BINARY_ADD          
                 61 STORE_FAST               6 (G)
    
     11          64 LOAD_CONST               8 ('0')
                 67 LOAD_ATTR                1 (join)
                 70 LOAD_CONST              12 (('10', '00'))
                 73 CALL_FUNCTION            1
                 76 STORE_FAST               7 (H)
    
     12          79 LOAD_GLOBAL              0 (str)
                 82 LOAD_CONST               1 ('10000')
                 85 CALL_FUNCTION            1
                 88 STORE_FAST               8 (I)
    
     14          91 SETUP_LOOP              66 (to 160)
                 94 LOAD_FAST                0 (A)
                 97 LOAD_FAST                1 (B)
                100 LOAD_FAST                2 (C)
                103 LOAD_FAST                3 (D)
                106 LOAD_FAST                4 (E)
                109 LOAD_FAST                5 (F)
                112 LOAD_FAST                6 (G)
                115 LOAD_FAST                7 (H)
                118 LOAD_FAST                8 (I)
                121 BUILD_TUPLE              9
                124 GET_ITER            
            >>  125 FOR_ITER                31 (to 159)
                128 STORE_FAST               9 (obj)
    
     15         131 LOAD_FAST                9 (obj)
                134 PRINT_ITEM          
                135 LOAD_GLOBAL              2 (id)
                138 LOAD_FAST                9 (obj)
                141 CALL_FUNCTION            1
                144 PRINT_ITEM          
                145 LOAD_FAST                9 (obj)
                148 LOAD_FAST                0 (A)
                151 COMPARE_OP               8 (is)
                154 PRINT_ITEM          
                155 PRINT_NEWLINE       
                156 JUMP_ABSOLUTE          125
            >>  159 POP_BLOCK           
            >>  160 LOAD_CONST               0 (None)
                163 RETURN_VALUE        
    

    so it would seem that indeed the compiler understands A-D to mean the same thing, and so it saves memory by only generating it once (as suggested by Alex,Maciej and Greg). (added case I seems to just be str() realising it's trying to make a string from a string, and just passing it through.)

    Thanks everyone, that's a lot clearer now.

    0 讨论(0)
  • 2020-12-21 09:17

    In terms of language specification, any compliant Python compiler and runtime is fully allowed, for any instance of an immutable type, to make a new instance OR find an existing instance of the same type that's equal to the required value and use a new reference to that same instance. This means it's always incorrect to use is or by-id comparison among immutables, and any minor release may tweak or change strategy in this matter to enhance optimization.

    In terms of implementations, the tradeoff are pretty clear: trying to reuse an existing instance may mean time spent (perhaps wasted) trying to find such an instance, but if the attempt succeeds then some memory is saved (as well as the time to allocate and later free the memory bits needed to hold a new instance).

    How to solve those implementation tradeoffs is not entirely obvious -- if you can identify heuristics that indicate that finding a suitable existing instance is likely and the search (even if it fails) will be fast, then you may want to attempt the search-and-reuse when the heuristics suggest it, but skip it otherwise.

    In your observations you seem to have found a particular dot-release implementation that performs a modicum of peephole optimization when that's entirely safe, fast, and simple, so the assignments A to D all boil down to exactly the same as A (but E to F don't, as they involve named functions or methods that the optimizer's authors may reasonably have considered not 100% safe to assume semantics for -- and low-ROI if that was done -- so they're not peephole-optimized).

    Thus, A to D reusing the same instance boils down to A and B doing so (as C and D get peephole-optimized to exactly the same construct).

    That reuse, in turn, clearly suggests compiler tactics/optimizer heuristics whereby identical literal constants of an immutable type in the same function's local namespace are collapsed to references to just one instance in the function's .func_code.co_consts (to use current CPython's terminology for attributes of functions and code objects) -- reasonable tactics and heuristics, as reuse of the same immutable constant literal within one function are somewhat frequent, AND the price is only paid once (at compile time) while the advantage is accrued many times (every time the function runs, maybe within loops etc etc).

    (It so happens that these specific tactics and heuristics, given their clearly-positive tradeoffs, have been pervasive in all recent versions of CPython, and, I believe, IronPython, Jython, and PyPy as well;-).

    This is a somewhat worthy and interesting are of study if you're planning to write compilers, runtime environments, peephole optimizers, etc etc, for Python itself or similar languages. I guess that deep study of the internals (ideally of many different correct implementations, of course, so as not to fixate on the quirks of a specific one -- good thing Python currently enjoys at least 4 separate production-worthy implementations, not to mention several versions of each!) can also help, indirectly, make one a better Python programmer -- but it's particularly important to focus on what's guaranteed by the language itself, which is somewhat less than what you'll find in common among separate implementations, because the parts that "just happen" to be in common right now (without being required to be so by the language specs) may perfectly well change under you at the next point release of one or another implementation and, if your production code was mistakenly relying on such details, that might cause nasty surprises;-). Plus -- it's hardly ever necessary, or even particularly helpful, to rely on such variable implementation details rather than on language-mandated behavior (unless you're coding something like an optimizer, debugger, profiler, or the like, of course;-).

    0 讨论(0)
提交回复
热议问题