how to decode an ascii string with backslash x \x codes

前端 未结 1 1585
暖寄归人
暖寄归人 2021-02-15 14:24

I am trying to decode from a Brazilian Portogese text:

\'Demais Subfun\\xc3\\xa7\\xc3\\xb5es 12\'

It should be

1条回答
  •  夕颜
    夕颜 (楼主)
    2021-02-15 15:13

    You have binary data that is not ASCII encoded. The \xhh codepoints indicate your data is encoded with a different codec, and you are seeing Python produce a representation of the data using the repr() function that can be re-used as a Python literal that accurately lets you re-create the exact same value. This representation is very useful when debugging a program.

    In other words, the \xhh escape sequences represent individual bytes, and the hh is the hex value of that byte. You have 4 bytes with hex values C3, A7, C3 and B5, that do not map to printable ASCII characters so Python uses the \xhh notation instead.

    You instead have UTF-8 data, decode it as such:

    >>> 'Demais Subfun\xc3\xa7\xc3\xb5es 12'.decode('utf8')
    u'Demais Subfun\xe7\xf5es 12'
    >>> print 'Demais Subfun\xc3\xa7\xc3\xb5es 12'.decode('utf8')
    Demais Subfunções 12
    

    The C3 A7 bytes together encode U+00E7 LATIN SMALL LETTER C WITH CEDILLA, while the C3 B5 bytes encode U+00F5 LATIN SMALL LETTER O WITH TILDE.

    ASCII happens to be a subset of the UTF-8 codec, which is why all the other letters can be represented as such in the Python repr() output.

    0 讨论(0)
提交回复
热议问题