So I have this page:
http://hub.iis.sinica.edu.tw/cytoHubba/
Apparently it\'s all kinds of messed up, as it gets decoded properly but when I try to save it in po
There is a bug in python 2.x that is only fixed python 3.x. In fact, this bug is even in OS X's iconv (but not the glibc one).
Here's what's happening:
Python 2.x does not recognize UTF8 surrogate pairs [1] as being invalid (which is what your character sequence is)
This should be all that's needed:
foo.decode('utf8').encode('utf8')
But thanks to that bug they're not fixing, it doesn't catch surrogate pairs.
Try this in python 2.x and then in 3.x:
b'\xed\xbd\xbf'.decode('utf8')
It will throw an error (correctly) in the latter. They aren't fixing it in the 2.x branch either. See [2] and [3] for more info
[1] http://tools.ietf.org/html/rfc3629#section-4
[2] http://bugs.python.org/issue9133
[3] http://bugs.python.org/issue8271#msg102209