Python 2 interned \"all name-character\" strings up to 20 code points long:
Python 2.7.15 (default, Feb 9 2019, 16:01:32)
[GCC 4.2.1 Compatible Apple LLVM
The interning happens when the code i.e. 'a'*4096
is compiled. Normally, when compiled this would lead to the following bytecode:
0 LOAD_CONST 1 ('a')
2 LOAD_CONST 2 (4096)
4 BINARY_MULTIPLY
However, because both are constant, constant folding can be executed for BINARY_MULTIPLY
at compile time by peephole-optimizer, which happens in fold_binop:
static int
fold_binop(expr_ty node, PyArena *arena, int optimize)
{
...
PyObject *newval;
switch (node->v.BinOp.op) {
...
case Mult:
newval = safe_multiply(lv, rv);
break;
case Div:
...
}
return make_const(node, newval, arena);
}
if safe_multiply
can be evaluated, the result is added to list of constants in make_const
, if safe_multiply
returns NULL
, nothing happens - the optimization cannot be performed.
safe_multiply is performed only if the resulting string is not larger than 4096 characters:
#define MAX_STR_SIZE 4096 /* characters */
static PyObject *
safe_multiply(PyObject *v, PyObject *w)
{
...
else if (PyLong_Check(v) && (PyUnicode_Check(w) || PyBytes_Check(w))) {
Py_ssize_t size = PyUnicode_Check(w) ? PyUnicode_GET_LENGTH(w) :
PyBytes_GET_SIZE(w);
if (size) {
long n = PyLong_AsLong(v);
if (n < 0 || n > MAX_STR_SIZE / size) { //HERE IS THE CHECK!
return NULL;
}
}
}
...
}
Now, all string constants are interned, once code-object is created in PyCode_New:
PyCodeObject *
PyCode_New(..., PyObject *consts,...)
{
...
intern_string_constants(consts);
...
And thus 'a'*4096
becomes interned and 'a'*4097
not, and for the first case the optimized byte codes is now:
0 LOAD_CONST 3 ('aaaaaaaaa...aaaaaaaaaaa')
It looks as if this is the responsible commit and this is the corresponding bug.
The behavior has changed since Python3.7 - in Python3.6 the limit was still 20.
In the old version, the product was executed, and afterwards checked, that the size of the result is less than 21:
static int
fold_binop(expr_ty node, PyArena *arena)
{
...
PyObject *newval;
switch (node->v.BinOp.op) {
...
case Mult:
newval = PyNumber_Multiply(lv, rv);
break;
...
default: // Unknown operator
return 1;
}
/* Avoid creating large constants. */
Py_ssize_t size = PyObject_Size(newval);
if (size == -1) {
...
}
else if (size > 20) { //HERE IS THE CHECK OF SIZE
Py_DECREF(newval);
return 1;
}
...
As for why the old value was 20
and the current is 4096
I can only speculate. However 4096
doesn't sound as too much for unicode-objects on modern machines, while 20
was a threshold for all classes, not only unicode.