How can I check a Python unicode string to see that it *actually* is proper Unicode?

前端 未结 5 866
一个人的身影
一个人的身影 2021-02-06 08:56

So I have this page:

http://hub.iis.sinica.edu.tw/cytoHubba/

Apparently it\'s all kinds of messed up, as it gets decoded properly but when I try to save it in po

5条回答
  •  滥情空心
    2021-02-06 09:05

    A Python unicode object is a sequence of Unicode codepoints and by definition proper unicode. A python str string is a sequence of bytes that might be Unicode characters encoded with a certain encoding (UTF-8, Latin-1, Big5,...).

    The first question there is if source is a unicode object or a str string. That source.encode("utf-8") works just means that you can convert source to a UTF-8 encoded string, but are you doing it before you pass it to the database function? The database seems to expect it's inputs to be encoded with UTF-8, and complains that the equivalent of source.decode("utf-8") fails.

    If source is a unicode object, it should be encoded to UTF-8 before you pass it to the database:

    source = u'abc'
    call_db(source.encode('utf-8'))
    

    If source is a str encoded as something else than Utf-8, you should decode that encoding and then encode the resulting Unicode object to UTF-8:

    source = 'abc'
    call_db(source.decode('Big5').encode('utf-8'))
    

提交回复
热议问题