Decoding if it's not unicode

后端 未结 2 1543
遥遥无期
遥遥无期 2021-02-13 16:07

I want my function to take an argument that could be an unicode object or a utf-8 encoded string. Inside my function, I want to convert the argument to unicode. I have something

2条回答
  •  醉话见心
    2021-02-13 16:57

    You could just try decoding it with the 'utf-8' codec, and if that does not work, then return the object.

    def myfunction(text):
        try:
            text = unicode(text, 'utf-8')
        except TypeError:
            return text
    
    print(myfunction(u'cer\xf3n'))
    # cerón
    

    When you take a unicode object and call its decode method with the 'utf-8' codec, Python first tries to convert the unicode object to a string object, and then it calls the string object's decode('utf-8') method.

    Sometimes the conversion from unicode object to string object fails because Python2 uses the ascii codec by default.

    So, in general, never try to decode unicode objects. Or, if you must try, trap it in a try..except block. There may be a few codecs for which decoding unicode objects works in Python2 (see below), but they have been removed in Python3.

    See this Python bug ticket for an interesting discussion of the issue, and also Guido van Rossum's blog:

    "We are adopting a slightly different approach to codecs: while in Python 2, codecs can accept either Unicode or 8-bits as input and produce either as output, in Py3k, encoding is always a translation from a Unicode (text) string to an array of bytes, and decoding always goes the opposite direction. This means that we had to drop a few codecs that don't fit in this model, for example rot13, base64 and bz2 (those conversions are still supported, just not through the encode/decode API)."

提交回复
热议问题