Python: Converting from ISO-8859-1/latin1 to UTF-8

匿名 (未验证) 提交于 2019-12-03 02:05:01

问题:

>>> apple = "\xC4pple" >>> apple '\xc4pple' >>> apple.encode("UTF-8") Traceback (most recent call last):   File "", line 1, in  UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in     range(128) 

What should I do?

回答1:

Try decoding it first, then encoding:

apple.decode('iso-8859-1').encode('utf8') 


回答2:

This is a common problem, so here's a relatively thorough illustration.

For non-unicode strings (i.e. those without u prefix like u'\xc4pple'), one must decode from the native encoding (iso8859-1/latin1, unless modified with the enigmatic sys.setdefaultencoding function) to unicode, then encode to a character set that can display the characters you wish, in this case I'd recommend UTF-8.

First, here is a handy utility function that'll help illuminate the patterns of Python 2.7 string and unicode:

>>> def tell_me_about(s): return (type(s), s) 

A plain string

>>> v = "\xC4pple" # iso-8859-1 aka latin1 encoded string  >>> tell_me_about(v) (, '\xc4pple')  >>> v '\xc4pple'        # representation in memory  >>> print v ?pple             # map the iso-8859-1 in-memory to iso-8859-1 chars                   # note that '\xc4' has no representation in iso-8859-1,                    # so is printed as "?". 

Decoding a iso8859-1 string - convert plain string to unicode

Encoding to UTF

>>> u8 = v.decode("iso-8859-1").encode("utf-8") >>> u8 '\xc3\x84pple'    # convert iso-8859-1 to unicode to utf-8  >>> tell_me_about(u8) (, '\xc3\x84pple')  >>> u16 = v.decode('iso-8859-1').encode('utf-16') >>> tell_me_about(u16) (, '\xff\xfe\xc4\x00p\x00p\x00l\x00e\x00')  >>> tell_me_about(u8.decode('utf8')) (, u'\xc4pple')  >>> tell_me_about(u16.decode('utf16')) (, u'\xc4pple') 

Relationship between unicode and UTF and latin1

Unicode Exceptions

 >>> u8.encode('iso8859-1') Traceback (most recent call last):   File "", line 1, in  UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:   ordinal not in range(128)  >>> u16.encode('iso8859-1') Traceback (most recent call last):   File "", line 1, in  UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0:   ordinal not in range(128)  >>> v.encode('iso8859-1') Traceback (most recent call last):   File "", line 1, in  UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:   ordinal not in range(128) 

One would get around these by converting from the specific encoding (latin-1, utf8, utf16) to unicode e.g. u8.decode('utf8').encode('latin1').

So perhaps one could draw the following principles and generalizations:

  • a type str is a set of bytes, which may have one of a number of encodings such as Latin-1, UTF-8, and UTF-16
  • a type unicode is a set of bytes that can be converted to any number of encodings, most commonly UTF-8 and latin-1 (iso8859-1)
  • the print command has its own logic for encoding, set to sys.stdout.encoding and defaulting to UTF-8
  • One must decode a str to unicode before converting to another encoding.

Of course, all of this changes in Python 3.x.

Hope that is illuminating.

Further reading

And the very illustrative rants by Armin Ronacher:



回答3:

Decode to Unicode, encode the results to UTF8.

apple.decode('latin1').encode('utf8')



回答4:

For Python 3:

bytes(apple,'iso-8859-1').decode('utf-8') 

I used this for a text incorrectly encoded as iso-8859-1 (showing words like ) instead of utf-8. This code produces correct version .



回答5:

concept = concept.encode('ascii', 'ignore') concept = MySQLdb.escape_string(concept.decode('latin1').encode('utf8').rstrip())

I do this, I am not sure if that is a good approach but it works everytime !!



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!