How to fix: “UnicodeDecodeError: 'ascii' codec can't decode byte”

前端 未结 19 1546
谎友^
谎友^ 2020-11-22 01:21
as3:~/ngokevin-site# nano content/blog/20140114_test-chinese.mkd
as3:~/ngokevin-site# wok
Traceback (most recent call last):
File \"/usr/local/bin/wok\", line 4, in
         


        
相关标签:
19条回答
  • 2020-11-22 01:54

    This is the classic "unicode issue". I believe that explaining this is beyond the scope of a StackOverflow answer to completely explain what is happening.

    It is well explained here.

    In very brief summary, you have passed something that is being interpreted as a string of bytes to something that needs to decode it into Unicode characters, but the default codec (ascii) is failing.

    The presentation I pointed you to provides advice for avoiding this. Make your code a "unicode sandwich". In Python 2, the use of from __future__ import unicode_literals helps.

    Update: how can the code be fixed:

    OK - in your variable "source" you have some bytes. It is not clear from your question how they got in there - maybe you read them from a web form? In any case, they are not encoded with ascii, but python is trying to convert them to unicode assuming that they are. You need to explicitly tell it what the encoding is. This means that you need to know what the encoding is! That is not always easy, and it depends entirely on where this string came from. You could experiment with some common encodings - for example UTF-8. You tell unicode() the encoding as a second parameter:

    source = unicode(source, 'utf-8')
    
    0 讨论(0)
  • 2020-11-22 01:54

    I find the best is to always convert to unicode - but this is difficult to achieve because in practice you'd have to check and convert every argument to every function and method you ever write that includes some form of string processing.

    So I came up with the following approach to either guarantee unicodes or byte strings, from either input. In short, include and use the following lambdas:

    # guarantee unicode string
    _u = lambda t: t.decode('UTF-8', 'replace') if isinstance(t, str) else t
    _uu = lambda *tt: tuple(_u(t) for t in tt) 
    # guarantee byte string in UTF8 encoding
    _u8 = lambda t: t.encode('UTF-8', 'replace') if isinstance(t, unicode) else t
    _uu8 = lambda *tt: tuple(_u8(t) for t in tt)
    

    Examples:

    text='Some string with codes > 127, like Zürich'
    utext=u'Some string with codes > 127, like Zürich'
    print "==> with _u, _uu"
    print _u(text), type(_u(text))
    print _u(utext), type(_u(utext))
    print _uu(text, utext), type(_uu(text, utext))
    print "==> with u8, uu8"
    print _u8(text), type(_u8(text))
    print _u8(utext), type(_u8(utext))
    print _uu8(text, utext), type(_uu8(text, utext))
    # with % formatting, always use _u() and _uu()
    print "Some unknown input %s" % _u(text)
    print "Multiple inputs %s, %s" % _uu(text, text)
    # but with string.format be sure to always work with unicode strings
    print u"Also works with formats: {}".format(_u(text))
    print u"Also works with formats: {},{}".format(*_uu(text, text))
    # ... or use _u8 and _uu8, because string.format expects byte strings
    print "Also works with formats: {}".format(_u8(text))
    print "Also works with formats: {},{}".format(*_uu8(text, text))
    

    Here's some more reasoning about this.

    0 讨论(0)
  • 2020-11-22 02:00

    In short, to ensure proper unicode handling in Python 2:

    • use io.open for reading/writing files
    • use from __future__ import unicode_literals
    • configure other data inputs/outputs (e.g., databases, network) to use unicode
    • if you cannot configure outputs to utf-8, convert your output for them print(text.encode('ascii', 'replace').decode())

    For explanations, see @Alastair McCormack's detailed answer.

    0 讨论(0)
  • 2020-11-22 02:01

    Finally I got it:

    as3:/usr/local/lib/python2.7/site-packages# cat sitecustomize.py
    # encoding=utf8  
    import sys  
    
    reload(sys)  
    sys.setdefaultencoding('utf8')
    

    Let me check:

    as3:~/ngokevin-site# python
    Python 2.7.6 (default, Dec  6 2013, 14:49:02)
    [GCC 4.4.5] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys
    >>> reload(sys)
    <module 'sys' (built-in)>
    >>> sys.getdefaultencoding()
    'utf8'
    >>>
    

    The above shows the default encoding of python is utf8. Then the error is no more.

    0 讨论(0)
  • 2020-11-22 02:02

    Specify: # encoding= utf-8 at the top of your Python File, It should fix the issue

    0 讨论(0)
  • 2020-11-22 02:03

    tl;dr / quick fix

    • Don't decode/encode willy nilly
    • Don't assume your strings are UTF-8 encoded
    • Try to convert strings to Unicode strings as soon as possible in your code
    • Fix your locale: How to solve UnicodeDecodeError in Python 3.6?
    • Don't be tempted to use quick reload hacks

    Unicode Zen in Python 2.x - The Long Version

    Without seeing the source it's difficult to know the root cause, so I'll have to speak generally.

    UnicodeDecodeError: 'ascii' codec can't decode byte generally happens when you try to convert a Python 2.x str that contains non-ASCII to a Unicode string without specifying the encoding of the original string.

    In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.

    The Markdown module authors probably use unicode() (where the exception is thrown) as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can't know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.

    Unicode strings can be declared in your code using the u prefix to strings. E.g.

    >>> my_u = u'my ünicôdé strįng'
    >>> type(my_u)
    <type 'unicode'>
    

    Unicode strings may also come from file, databases and network modules. When this happens, you don't need to worry about the encoding.

    Gotchas

    Conversion from str to Unicode can happen even when you don't explicitly call unicode().

    The following scenarios cause UnicodeDecodeError exceptions:

    # Explicit conversion without encoding
    unicode('€')
    
    # New style format string into Unicode string
    # Python will try to convert value string to Unicode first
    u"The currency is: {}".format('€')
    
    # Old style format string into Unicode string
    # Python will try to convert value string to Unicode first
    u'The currency is: %s' % '€'
    
    # Append string to Unicode
    # Python will try to convert string to Unicode first
    u'The currency is: ' + '€'         
    

    Examples

    In the following diagram, you can see how the word café has been encoded in either "UTF-8" or "Cp1252" encoding depending on the terminal type. In both examples, caf is just regular ascii. In UTF-8, é is encoded using two bytes. In "Cp1252", é is 0xE9 (which is also happens to be the Unicode point value (it's no coincidence)). The correct decode() is invoked and conversion to a Python Unicode is successfull:

    In this diagram, decode() is called with ascii (which is the same as calling unicode() without an encoding given). As ASCII can't contain bytes greater than 0x7F, this will throw a UnicodeDecodeError exception:

    The Unicode Sandwich

    It's good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to strs on the way out. This saves you from worrying about the encoding of strings in the middle of your code.

    Input / Decode

    Source code

    If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a u. E.g.

    u'Zürich'
    

    To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as 'UTF-8', you would use:

    # encoding: utf-8
    

    This is only necessary when you have non-ASCII in your source code.

    Files

    Usually non-ASCII data is received from a file. The io module provides a TextWrapper that decodes your file on the fly, using a given encoding. You must use the correct encoding for the file - it can't be easily guessed. For example, for a UTF-8 file:

    import io
    with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
         my_unicode_string = my_file.read() 
    

    my_unicode_string would then be suitable for passing to Markdown. If a UnicodeDecodeError from the read() line, then you've probably used the wrong encoding value.

    CSV Files

    The Python 2.7 CSV module does not support non-ASCII characters

    0 讨论(0)
提交回复
热议问题