Reliable way of handling non-ASCII characters in Python?

问题

I have a column a spreadsheet whose header contains non-ASCII characters thus:

'ï»¿Campaign'

If I pop this string into the interpreter, I get:

'\xc3\xaf\xc2\xbb\xc2\xbfCampaign'

The string is one the keys in the rows of a csv.DictReader()

When I try to populate a new dict with with the value of this key:

spends['ï»¿Campaign'] = 2

I get:

Key Error: '\xc3\xaf\xc2\xbb\xc2\xbfCampaign'

If I print the value of the keys of row, I can see that it is '\xef\xbb\xbfCampaign'

Obviously then I can just update my program to access this key thus:

spends['\xef\xbb\xbfCampaign']

But is there a "better" way of doing this, in Python? Indeed, if the value of this key every changes to contain other non-ASCII characters, what is an all-encompassing way of handling any all non-ASCII characters that may arise?

回答1:

In general, you should decode a bytestring into Unicode text using the corresponding character encoding as soon as possible on input. And, in reverse, encode Unicode text into a bytestring as late as possible on output. Some APIs such as io.open() can do it implicitly so that your code sees only Unicode.

Unfortunately, csv module does not support Unicode directly on Python 2. See UnicodeReader, UnicodeWriter in the doc examples. You could create their analog for csv.DictReader or as an alternative just pass utf-8 encoded bytestrings to csv module.

回答2:

Your specific problem is the first three bytes of the file, "\xef\xbb\xbf". That's the UTF-8 encoding of the byte order mask and often prepended to text files to indicate they're encoded using UTF-8. You should strip these bytes. See Removing BOM from gzip'ed CSV in Python.

Second, you're decoding with the wrong codec. "ï»¿" is what you get if you decode those bytes using the Windows-1252 character set. That's why the bytes look different if you use these characters in a source file. See the Python 2 Unicode howto.

来源：https://stackoverflow.com/questions/31276483/reliable-way-of-handling-non-ascii-characters-in-python

标签

python

python-2.7

unicode

character-encoding

non-ascii-characters