In Python 3.x, a string consists of items of Unicode ordinal. (See the quotation from the language reference below.) What is the internal representation of Unicode string? I
Looking at the source code for CPython 3.1.5, in Include/unicodeobject.h:
/* --- Unicode Type ------------------------------------------------------- */
typedef struct {
PyObject_HEAD
Py_ssize_t length; /* Length of raw Unicode data in buffer */
Py_UNICODE *str; /* Raw Unicode buffer */
long hash; /* Hash value; -1 if not set */
int state; /* != 0 if interned. In this case the two
* references from the dictionary to this object
* are *not* counted in ob_refcnt. */
PyObject *defenc; /* (Default) Encoded version as Python
string, or NULL; this is used for
implementing the buffer protocol */
} PyUnicodeObject;
The characters are stored as an array of Py_UNICODE
. On most platforms, I believe Py_UNICODE
is #define
d as wchar_t
.
The internal representation varies from latin-1, UCS-2 to UCS-4 . UCS means that the representaion is 2 or 4 bytes long and the unicode code-units are numerically equal to the corresponding code-points. We can check this by finding where the sizes of the code units change.
To show that they range from 1 byte of latin-1 to to 4 bytes of UCS-4:
>>> getsizeof('')
49
>>> getsizeof('a') #------------------ + 1 byte as the representaion here is latin-1
50
>>> getsizeof('\U0010ffff')
80
>>> getsizeof('\U0010ffff\U0010ffff') # + 4 bytes as the representation here is UCS-4
84
We can check that in the beginning representation is indeed latin-1 and not UTF-8 as the change to 2-byte code unit happens at the byte boundary and not at ''\U0000007f'
- '\U00000080'
boundary as in UTF-8:
>>> getsizeof('\U0000007f')
50
>>> getsizeof('\U00000080') #----------The size of the string changes at \x74 - \x80 boundary but..
74
>>> getsizeof('\U00000080\U00000080') # ..the size of the code-unit is still one. so not UTF-8
75
>>> getsizeof('\U000000ff')
74
>>> getsizeof('\U000000ff\U000000ff')# (+1 byte)
75
>>> getsizeof('\U00000100')
76
>>> getsizeof('\U00000100\U00000100') # Size change at byte boundary(+2 bytes). Rep is UCS-2.
78
>>> getsizeof('\U0000ffff')
76
>>> getsizeof('\U0000ffff\U0000ffff') # (+ 2 bytes)
78
>>> getsizeof('\U00010000')
80
>>> getsizeof('\U00010000\U00010000') # (+ 4 bytes) Thes size of the code unit changes to 4 at byte boundary again.
84
In Python 3.3 and above, the internal representation of the string will depend on the string, and can be any of latin-1, UCS-2 or UCS-4, as described in PEP 393.
For previous Pythons, the internal representation depends on the build flags of Python. Python can be built with flag values --enable-unicode=ucs2
or --enable-unicode=ucs4
. ucs2
builds do in fact use UTF-16 as their internal representation, and ucs4
builds use UCS-4 / UTF-32.
There has been NO CHANGE in Unicode internal representation between Python 2.X and 3.X.
It's definitely NOT UTF-16. UTF-anything is a byte-oriented EXTERNAL representation.
Each code unit (character, surrogate, etc) has been assigned a number from range(0, 2 ** 21). This is called its "ordinal".
Really, the documentation you quoted says it all. Most Python binaries use 16-bit ordinals which restricts you to the Basic Multilingual Plane ("BMP") unless you want to muck about with surrogates (handy if you can't find your hair shirt and your bed of nails is off being de-rusted). For working with the full Unicode repertoire, you'd prefer a "wide build" (32 bits wide).
Briefly, the internal representation in a unicode object is an array of 16-bit unsigned integers, or an array of 32-bit unsigned integers (using only 21 bits).
>>> import array; s = 'Привет мир!'; b = array.array('u', s).tobytes(); print(b); print(len(s) * 4 == len(b))
b'\x1f\x04\x00\x00@\x04\x00\x008\x04\x00\x002\x04\x00\x005\x04\x00\x00B\x04\x00\x00 \x00\x00\x00<\x04\x00\x008\x04\x00\x00@\x04\x00\x00!\x00\x00\x00'
True
>>> import array; s = 'test'; b = array.array('u', s).tobytes(); print(b); print(len(s) * 4 == len(b))
b't\x00\x00\x00e\x00\x00\x00s\x00\x00\x00t\x00\x00\x00'
True
>>>
The internal representation will change in Python 3.3 which implements PEP 393. The new representation will pick one or several of ascii, latin-1, utf-8, utf-16, utf-32, generally trying to get a compact representation.
Implicit conversions into surrogate pairs will only be done when talking to legacy APIs (those only exist on windows, where wchar_t is two bytes); the Python string will be preserved. Here are the release notes.