as3:~/ngokevin-site# nano content/blog/20140114_test-chinese.mkd
as3:~/ngokevin-site# wok
Traceback (most recent call last):
File \"/usr/local/bin/wok\", line 4, in
reload
hacksWithout seeing the source it's difficult to know the root cause, so I'll have to speak generally.
UnicodeDecodeError: 'ascii' codec can't decode byte
generally happens when you try to convert a Python 2.x str
that contains non-ASCII to a Unicode string without specifying the encoding of the original string.
In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.
The Markdown module authors probably use unicode()
(where the exception is thrown) as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can't know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.
Unicode strings can be declared in your code using the u
prefix to strings. E.g.
>>> my_u = u'my ünicôdé strįng'
>>> type(my_u)
Unicode strings may also come from file, databases and network modules. When this happens, you don't need to worry about the encoding.
Conversion from str
to Unicode can happen even when you don't explicitly call unicode()
.
The following scenarios cause UnicodeDecodeError
exceptions:
# Explicit conversion without encoding
unicode('€')
# New style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: {}".format('€')
# Old style format string into Unicode string
# Python will try to convert value string to Unicode first
u'The currency is: %s' % '€'
# Append string to Unicode
# Python will try to convert string to Unicode first
u'The currency is: ' + '€'
In the following diagram, you can see how the word café
has been encoded in either "UTF-8" or "Cp1252" encoding depending on the terminal type. In both examples, caf
is just regular ascii. In UTF-8, é
is encoded using two bytes. In "Cp1252", é is 0xE9 (which is also happens to be the Unicode point value (it's no coincidence)). The correct decode()
is invoked and conversion to a Python Unicode is successfull:
In this diagram, decode()
is called with ascii
(which is the same as calling unicode()
without an encoding given). As ASCII can't contain bytes greater than 0x7F
, this will throw a UnicodeDecodeError
exception:
It's good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to str
s on the way out. This saves you from worrying about the encoding of strings in the middle of your code.
If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a u
. E.g.
u'Zürich'
To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as 'UTF-8', you would use:
# encoding: utf-8
This is only necessary when you have non-ASCII in your source code.
Usually non-ASCII data is received from a file. The io
module provides a TextWrapper that decodes your file on the fly, using a given encoding
. You must use the correct encoding for the file - it can't be easily guessed. For example, for a UTF-8 file:
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
my_unicode_string = my_file.read()
my_unicode_string
would then be suitable for passing to Markdown. If a UnicodeDecodeError
from the read()
line, then you've probably used the wrong encoding value.
The Python 2.7 CSV module does not support non-ASCII characters