Python - Unicode to ASCII conversion

前端 未结 2 2014
夕颜
夕颜 2020-11-29 08:30

I am unable to convert the following Unicode to ASCII without losing data:

u\'ABRA\\xc3O JOS\\xc9\'

I tried encode and d

相关标签:
2条回答
  • 2020-11-29 09:05

    I needed to calculate the MD5 hash of a unicode string received in HTTP request. MD5 was giving UnicodeEncodeError and python built-in encoding methods didn't work because it replaces the characters in the string with corresponding hex values for the characters thus changing the MD5 hash. So I came up with the following code, which keeps the string intact while converting from unicode.

    unicode_string = ''.join([chr(ord(x)) for x in unicode_string]).strip()
    

    This removes the unicode part from the string and keeps all the data intact.

    0 讨论(0)
  • 2020-11-29 09:29

    The Unicode characters u'\xce0' and u'\xc9' do not have any corresponding ASCII values. So, if you don't want to lose data, you have to encode that data in some way that's valid as ASCII. Options include:

    >>> print s.encode('ascii', errors='backslashreplace')
    ABRA\xc3O JOS\xc9
    >>> print s.encode('ascii', errors='xmlcharrefreplace')
    ABRAÃO JOSÉ
    >>> print s.encode('unicode-escape')
    ABRA\xc3O JOS\xc9
    >>> print s.encode('punycode')
    ABRAO JOS-jta5e
    

    All of these are ASCII strings, and contain all of the information from your original Unicode string (so they can all be reversed without loss of data), but none of them are all that pretty for an end-user (and none of them can be reversed just by decode('ascii')).

    See str.encode, Python Specific Encodings, and Unicode HOWTO for more info.


    As a side note, when some people say "ASCII", they really don't mean "ASCII" but rather "any 8-bit character set that's a superset of ASCII" or "some particular 8-bit character set that I have in mind". If that's what you meant, the solution is to encode to the right 8-bit character set:

    >>> s.encode('utf-8')
    'ABRA\xc3\x83O JOS\xc3\x89'
    >>> s.encode('cp1252')
    'ABRA\xc3O JOS\xc9'
    >>> s.encode('iso-8859-15')
    'ABRA\xc3O JOS\xc9'
    

    The hard part is knowing which character set you meant. If you're writing both the code that produces the 8-bit strings and the code that consumes it, and you don't know any better, you meant UTF-8. If the code that consumes the 8-bit strings is, say, the open function or a web browser that you're serving a page to or something else, things are more complicated, and there's no easy answer without a lot more information.

    0 讨论(0)
提交回复
热议问题