How can I check a Python unicode string to see that it *actually* is proper Unicode?

前端 未结 5 867
一个人的身影
一个人的身影 2021-02-06 08:56

So I have this page:

http://hub.iis.sinica.edu.tw/cytoHubba/

Apparently it\'s all kinds of messed up, as it gets decoded properly but when I try to save it in po

5条回答
  •  时光取名叫无心
    2021-02-06 09:16

    There is a bug in python 2.x that is only fixed python 3.x. In fact, this bug is even in OS X's iconv (but not the glibc one).

    Here's what's happening:

    Python 2.x does not recognize UTF8 surrogate pairs [1] as being invalid (which is what your character sequence is)

    This should be all that's needed:

    foo.decode('utf8').encode('utf8')
    

    But thanks to that bug they're not fixing, it doesn't catch surrogate pairs.

    Try this in python 2.x and then in 3.x:

    b'\xed\xbd\xbf'.decode('utf8')
    

    It will throw an error (correctly) in the latter. They aren't fixing it in the 2.x branch either. See [2] and [3] for more info

    [1] http://tools.ietf.org/html/rfc3629#section-4

    [2] http://bugs.python.org/issue9133

    [3] http://bugs.python.org/issue8271#msg102209

提交回复
热议问题