问题
It is really confusing to handle non-ascii code char in python. Can any one explain?
I'm trying to read a plain text file and replace all non-alphabetic characters with spaces.
I have a list of characters:
ignorelist = ('!', '-', '_', '(', ')', ',', '.', ':', ';', '"', '\'', '?', '#', '@', '$', '^', '&', '*', '+', '=', '{', '}', '[', ']', '\\', '|', '<', '>', '/', u'—')
for each token i got, i replace any char in that token with space by calling
for punc in ignorelist:
token = token.replace(punc, ' ')
notice there's a non ascii code character at the end of ignorelist
: u'—'
Everytime when my code encounters that character, it crashes and say:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
I tried to declare the encoding by adding # -*- coding: utf-8 -*-
at the top of the file, but still not working. anyone knows why? Thanks!
回答1:
You are using Python 2.x, and it will try to auto-convert unicode
s and plain str
s, but it often fails with non-ascii characters.
You shouldn't mix unicode
s and str
s together. You can either stick to unicode
s:
ignorelist = (u'!', u'-', u'_', u'(', u')', u',', u'.', u':', u';', u'"', u'\'', u'?', u'#', u'@', u'$', u'^', u'&', u'*', u'+', u'=', u'{', u'}', u'[', u']', u'\\', u'|', u'<', u'>', u'/', u'—')
if not isinstance(token, unicode):
token = token.decode('utf-8') # assumes you are using UTF-8
for punc in ignorelist:
token = token.replace(punc, u' ')
or use only plain str
s (note the last one):
ignorelist = ('!', '-', '_', '(', ')', ',', '.', ':', ';', '"', '\'', '?', '#', '@', '$', '^', '&', '*', '+', '=', '{', '}', '[', ']', '\\', '|', '<', '>', '/', u'—'.encode('utf-8'))
# and other parts do not need to change
By manually encoding your u'—'
into a str
, Python won't need to try that by itself.
I suggest you use unicode
all across your program to avoid this kind of errors. But if it'd be too much work, you can use the latter method. However, take care when you call some functions in standard library or third party modules.
# -*- coding: utf-8 -*-
only tells Python that your code is written in UTF-8 (or you'll get a SyntaxError
).
回答2:
Your file input is not utf-8. So when you hit that unicode character your input barfs on the compare because it views your input as ascii.
Try reading the file with this instead.
import codecs
f = codecs.open("test", "r", "utf-8")
来源:https://stackoverflow.com/questions/15737048/handle-non-ascii-code-string-in-python