handle non ascii code string in python

穿精又带淫゛_ 提交于 2019-12-08 06:09:39

问题


It is really confusing to handle non-ascii code char in python. Can any one explain?

I'm trying to read a plain text file and replace all non-alphabetic characters with spaces.

I have a list of characters:

ignorelist = ('!', '-', '_', '(', ')', ',', '.', ':', ';', '"', '\'', '?', '#', '@', '$', '^', '&', '*', '+', '=', '{', '}', '[', ']', '\\', '|', '<', '>', '/', u'—')

for each token i got, i replace any char in that token with space by calling

    for punc in ignorelist:
        token = token.replace(punc, ' ')

notice there's a non ascii code character at the end of ignorelist: u'—'

Everytime when my code encounters that character, it crashes and say:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position

I tried to declare the encoding by adding # -*- coding: utf-8 -*- at the top of the file, but still not working. anyone knows why? Thanks!


回答1:


You are using Python 2.x, and it will try to auto-convert unicodes and plain strs, but it often fails with non-ascii characters.

You shouldn't mix unicodes and strs together. You can either stick to unicodes:

ignorelist = (u'!', u'-', u'_', u'(', u')', u',', u'.', u':', u';', u'"', u'\'', u'?', u'#', u'@', u'$', u'^', u'&', u'*', u'+', u'=', u'{', u'}', u'[', u']', u'\\', u'|', u'<', u'>', u'/', u'—')

if not isinstance(token, unicode):
    token = token.decode('utf-8') # assumes you are using UTF-8
for punc in ignorelist:
    token = token.replace(punc, u' ')

or use only plain strs (note the last one):

ignorelist = ('!', '-', '_', '(', ')', ',', '.', ':', ';', '"', '\'', '?', '#', '@', '$', '^', '&', '*', '+', '=', '{', '}', '[', ']', '\\', '|', '<', '>', '/', u'—'.encode('utf-8'))
# and other parts do not need to change

By manually encoding your u'—' into a str, Python won't need to try that by itself.

I suggest you use unicode all across your program to avoid this kind of errors. But if it'd be too much work, you can use the latter method. However, take care when you call some functions in standard library or third party modules.

# -*- coding: utf-8 -*- only tells Python that your code is written in UTF-8 (or you'll get a SyntaxError).




回答2:


Your file input is not utf-8. So when you hit that unicode character your input barfs on the compare because it views your input as ascii.

Try reading the file with this instead.

import codecs
f = codecs.open("test", "r", "utf-8")


来源:https://stackoverflow.com/questions/15737048/handle-non-ascii-code-string-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!