Why does chardet say my UTF-8-encoded string (originally decoded from ISO-8859-1) is ASCII?

问题

I'm trying to convert ascii characters to utf-8. This little example below still returns ascii characters:

chunk = chunk.decode('ISO-8859-1').encode('UTF-8')
print chardet.detect(chunk[0:2000])

It returns:

{'confidence': 1.0, 'encoding': 'ascii'}

How come?

回答1:

Quoting from Python's documentation:

UTF-8 has several convenient properties:

It can handle any Unicode code point.

A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes.

A string of ASCII text is also a valid UTF-8 text.

All ASCII texts are also valid UTF-8 texts. (UTF-8 is a superset of ASCII)

To make it clear, check out this console session:

>>> s = 'test'
>>> s.encode('ascii') == s.encode('utf-8')
True
>>>

However, not all string with UTF-8 encoding is valid ASCII string:

>>> foreign_string = u"éâô"
>>> foreign_string.encode('utf-8')
'\xc3\xa9\xc3\xa2\xc3\xb4'
>>> foreign_string.encode('ascii') #This won't work, since it's invalid in ASCII encoding

Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    foreign_string.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
>>>

So, chardet is still right. Only if there is a character that is not ascii, chardet would be able to tell, it's not ascii encoded.

Hope this simple explanation helps!

回答2:

UTF-8 is a superset of ASCII. This means that that every valid Ascii file (that only uses the first 128 characters, not the extended characters) will also be a valid UTF-8 file. Since encoding is not stored explicitly, but guessed each time, it will default to the simpler character set. However, if you were to encode anything beyond the basic 128 characters (like foreign text and such) in UTF-8, it would be very likely to guess the encoding as UTF-8.

回答3:

this is the reason why You got ascii

https://github.com/erikrose/chardet/blob/master/chardet/universaldetector.py#L135

If all characters in sequence is ascii symbols chardet consider string encoding as ascii

N.B.

The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, making valid ASCII text valid UTF-8-encoded Unicode as well.

来源：https://stackoverflow.com/questions/19652939/why-does-chardet-say-my-utf-8-encoded-string-originally-decoded-from-iso-8859-1

标签

python

encoding

utf-8

ascii

decoding