How can I check a Python unicode string to see that it actually is proper Unicode?

前端未结

关注

 5  864

一个人的身影

So I have this page:

http://hub.iis.sinica.edu.tw/cytoHubba/

Apparently it\'s all kinds of messed up, as it gets decoded properly but when I try to save it in po

相关标签:

5条回答

鱼传尺愫

2021-02-06 08:58

In the end, I opted to just work around this, catch the error and rollback the transaction using Django's transaction management. I'm mystified as to why it would happen, though...

0 讨论(0)
发布评论:

提交评论
- 加载中...
滥情空心

2021-02-06 09:05
A Python unicode object is a sequence of Unicode codepoints and by definition proper unicode. A python str string is a sequence of bytes that might be Unicode characters encoded with a certain encoding (UTF-8, Latin-1, Big5,...).

The first question there is if source is a unicode object or a str string. That source.encode("utf-8") works just means that you can convert source to a UTF-8 encoded string, but are you doing it before you pass it to the database function? The database seems to expect it's inputs to be encoded with UTF-8, and complains that the equivalent of source.decode("utf-8") fails.

If source is a unicode object, it should be encoded to UTF-8 before you pass it to the database:
```
source = u'abc'
call_db(source.encode('utf-8'))
```
If source is a str encoded as something else than Utf-8, you should decode that encoding and then encode the resulting Unicode object to UTF-8:
```
source = 'abc'
call_db(source.decode('Big5').encode('utf-8'))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

囚心锁ツ

2021-02-06 09:05

To solve my similar problems with django/postgress I now do something like this

class SafeTextField(models.TextField)
    def get_prep_value(self, value):
        encoded = base64.encodestring(value).strip()
        return super(SafeTextField, self).get_prep_value(encoded)
    def to_python(self, value):
        decoded = base64.decodestring(value)
        return super(SafeTextField, self).to_python(decoded)

0 讨论(0)

迷失自我

2021-02-06 09:13
What exactly are you doing? The content does indeed decode fine as utf-8:
```
>>> import urllib
>>> webcontent = urllib.urlopen("http://hub.iis.sinica.edu.tw/cytoHubba/").read()
>>> unicodecontent = webcontent.decode("utf-8")
>>> type(webcontent)
<type 'str'>
>>> type(unicodecontent)
<type 'unicode'>
>>> type(unicodecontent.encode("utf-8"))
<type 'str'>
```
Make sure you understand the difference between Unicode strings and utf-8 encoded strings, though. What you need to send to the database is unicodecontent.encode("utf-8") (which is the same as webcontent, but you decoded to verify that you don't have invalid byte sequences in your source).

I'd indeed as WoLpH says check the settings on the database and the database connection.
0 讨论(0)
发布评论:

提交评论
- 加载中...
时光取名叫无心

2021-02-06 09:16
There is a bug in python 2.x that is only fixed python 3.x. In fact, this bug is even in OS X's iconv (but not the glibc one).

Here's what's happening:

Python 2.x does not recognize UTF8 surrogate pairs [1] as being invalid (which is what your character sequence is)

This should be all that's needed:
```
foo.decode('utf8').encode('utf8')
```
But thanks to that bug they're not fixing, it doesn't catch surrogate pairs.

Try this in python 2.x and then in 3.x:
```
b'\xed\xbd\xbf'.decode('utf8')
```
It will throw an error (correctly) in the latter. They aren't fixing it in the 2.x branch either. See [2] and [3] for more info

[1] http://tools.ietf.org/html/rfc3629#section-4

[2] http://bugs.python.org/issue9133

[3] http://bugs.python.org/issue8271#msg102209
0 讨论(0)
发布评论:

提交评论
- 加载中...

How can I check a Python unicode string to see that it *actually* is proper Unicode?

How can I check a Python unicode string to see that it actually is proper Unicode?