Consider the following exchange on IPython:
In [1]: s = u\'華袞與緼
I think this has been fixen in 3.3. See:
http://docs.python.org/py3k/whatsnew/3.3.html
http://www.python.org/dev/peps/pep-0393/ (search for wstr_length
)
You can override the len function in Python (see: How does len work?) and add an if statement in it to check for the extra long unicode.
I make a function to do this on Python 2:
SURROGATE_PAIR = re.compile(u'[\ud800-\udbff][\udc00-\udfff]', re.UNICODE)
def unicodeLen(s):
return len(SURROGATE_PAIR.sub('.', s))
By replacing surrogate pairs with a single character, we 'fix' the len
function. On normal strings, this should be pretty efficient: since the pattern won't match, the original string will be returned without modification. It should work on wide (32-bit) Python builds, too, as the surrogate pair encoding will not be used.