Change in max length of interned strings in CPython

后端 未结 1 1356
醉话见心
醉话见心 2020-12-06 13:05

Python 2 interned \"all name-character\" strings up to 20 code points long:

Python 2.7.15 (default, Feb  9 2019, 16:01:32) 
[GCC 4.2.1 Compatible Apple LLVM          


        
相关标签:
1条回答
  • 2020-12-06 13:57

    The interning happens when the code i.e. 'a'*4096 is compiled. Normally, when compiled this would lead to the following bytecode:

     0 LOAD_CONST               1 ('a')
     2 LOAD_CONST               2 (4096)
     4 BINARY_MULTIPLY
    

    However, because both are constant, constant folding can be executed for BINARY_MULTIPLY at compile time by peephole-optimizer, which happens in fold_binop:

    static int
    fold_binop(expr_ty node, PyArena *arena, int optimize)
    {
        ...
        PyObject *newval;
    
        switch (node->v.BinOp.op) {
        ...
        case Mult:
            newval = safe_multiply(lv, rv);
            break;
        case Div:
         ...
        }
    
        return make_const(node, newval, arena);
    }
    

    if safe_multiply can be evaluated, the result is added to list of constants in make_const, if safe_multiply returns NULL, nothing happens - the optimization cannot be performed.

    safe_multiply is performed only if the resulting string is not larger than 4096 characters:

    #define MAX_STR_SIZE          4096  /* characters */
    
    static PyObject *
    safe_multiply(PyObject *v, PyObject *w)
    {
        ...
        else if (PyLong_Check(v) && (PyUnicode_Check(w) || PyBytes_Check(w))) {
            Py_ssize_t size = PyUnicode_Check(w) ? PyUnicode_GET_LENGTH(w) :
                                                   PyBytes_GET_SIZE(w);
            if (size) {
                long n = PyLong_AsLong(v);
                if (n < 0 || n > MAX_STR_SIZE / size) {  //HERE IS THE CHECK!
                    return NULL;
                }
            }
        }
        ...
    }
    

    Now, all string constants are interned, once code-object is created in PyCode_New:

    PyCodeObject *
    PyCode_New(..., PyObject *consts,...)
    {
        ...
        intern_string_constants(consts);
        ...
    

    And thus 'a'*4096 becomes interned and 'a'*4097 not, and for the first case the optimized byte codes is now:

     0 LOAD_CONST               3 ('aaaaaaaaa...aaaaaaaaaaa')
    

    It looks as if this is the responsible commit and this is the corresponding bug.

    The behavior has changed since Python3.7 - in Python3.6 the limit was still 20.

    In the old version, the product was executed, and afterwards checked, that the size of the result is less than 21:

    static int
    fold_binop(expr_ty node, PyArena *arena)
    { 
        ...
        PyObject *newval;
    
        switch (node->v.BinOp.op) {
        ...
        case Mult:
            newval = PyNumber_Multiply(lv, rv);
            break;
        ...
        default: // Unknown operator
            return 1;
        }
    
        /* Avoid creating large constants. */
        Py_ssize_t size = PyObject_Size(newval);
        if (size == -1) {
           ...
        }
        else if (size > 20) {    //HERE IS THE CHECK OF SIZE
            Py_DECREF(newval);
            return 1;
        }
        ...
    

    As for why the old value was 20 and the current is 4096 I can only speculate. However 4096 doesn't sound as too much for unicode-objects on modern machines, while 20 was a threshold for all classes, not only unicode.

    0 讨论(0)
提交回复
热议问题