i know that django uses unicode strings all over the framework instead of normal python strings. what encoding are normal python strings use ? and why don\'t they use unicod
Hey! I'd like to add some stuff to other answers, unfortunately I don't have enough rep yet to do that properly :-(
FWIW, Mike Graham's post is pretty good and that's probably what you should be reading first.
Here's a few comments:
from __future__ import unicode_literals
# -*- coding: utf-8 -*-
. For more information see PEP 0263. Changing the source encoding affects how Unicode literals (regardless of their prefix or lack of prefix, as affected by point 1) are interpreted. In Py3k, the default file encoding is UTF-8.str
in py3k, unicode
in 2.x) because at some point in time stuff's going to have to be written to memory. Ideally, this would never be evident to the end-user. Unfortunately nothing's perfect and you can occasionally run into problems with this: specifically if you use funky squiggles outside of the Unicode Base Multilingual Plane. Since Python 2.2, we've had what's called wide builds and narrow builds; these names refer to the type used internally to store Unicode code points. Wide builds use UCS-4, which uses 4 bytes to store a Unicode code point. (This means UCS-4's code unit size is 4 bytes, or 32 bits.) Narrow builds use UCS-2. UCS-2 only has 16 bits, and therefore can not encode all Unicode code points accurately (it's like UTF-16, except without the surrogate pairs). To check, test the value of sys.maxunicode
. If it's 1114111
, you've got a wide build (which can correctly represent all of Unicode). If it's less, well, don't fret too much. The BMP (code points 0x0000
to 0xFFFF
) covers most people's needs. For more information, see PEP 0261.