发表新帖

发表新帖

How can I check a Python unicode string to see that it actually is proper Unicode?

前端未结

关注

 5  867

一个人的身影 2021-02-06 08:56

So I have this page:

http://hub.iis.sinica.edu.tw/cytoHubba/

Apparently it\'s all kinds of messed up, as it gets decoded properly but when I try to save it in po

5条回答

时光取名叫无心 (楼主)

2021-02-06 09:16
There is a bug in python 2.x that is only fixed python 3.x. In fact, this bug is even in OS X's iconv (but not the glibc one).

Here's what's happening:

Python 2.x does not recognize UTF8 surrogate pairs [1] as being invalid (which is what your character sequence is)

This should be all that's needed:
```
foo.decode('utf8').encode('utf8')
```
But thanks to that bug they're not fixing, it doesn't catch surrogate pairs.

Try this in python 2.x and then in 3.x:
```
b'\xed\xbd\xbf'.decode('utf8')
```
It will throw an error (correctly) in the latter. They aren't fixing it in the 2.x branch either. See [2] and [3] for more info

[1] http://tools.ietf.org/html/rfc3629#section-4

[2] http://bugs.python.org/issue9133

[3] http://bugs.python.org/issue8271#msg102209
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题