Python get character code in different encoding?

后端 未结 3 1618
情书的邮戳
情书的邮戳 2021-02-04 06:05

Given a character code as integer number in one encoding, how can you get the character code in, say, utf-8 and again as integer?

相关标签:
3条回答
  • 2021-02-04 06:38

    Here's an example of how the encode/decode dance works:

    >>> s = b'd\x06'             # perhaps start with bytes encoded in utf-16
    >>> map(ord, s)              # show those bytes as integers
    [100, 6]
    >>> u = s.decode('utf-16')   # turn the bytes into unicode
    >>> print u                  # show what the character looks like
    ٤
    >>> print ord(u)             # show the unicode code point as an integer
    1636
    >>> t = u.encode('utf-8')    # turn the unicode into bytes with a different encoding
    >>> map(ord, t)              # show that encoding as integers
    [217, 164]
    

    Hope this helps :-)

    If you need to construct the unicode directly from an integer, use unichr:

    >>> u = unichr(1636)
    >>> print u
    ٤
    
    0 讨论(0)
  • 2021-02-04 06:42

    UTF-8 is a variable-length encoding, so I'll assume you really meant "Unicode code point". Use chr() to convert the character code to a character, decode it, and use ord() to get the code point.

    >>> ord(chr(145).decode('koi8-r'))
    9618
    
    0 讨论(0)
  • 2021-02-04 06:55

    You can only map an "integer number" from one encoding to another if they are both single-byte encodings.

    Here's an example using "iso-8859-15" and "cp1252" (aka "ANSI"):

    >>> s = u'€'
    >>> s.encode('iso-8859-15')
    '\xa4'
    >>> s.encode('cp1252')
    '\x80'
    >>> ord(s.encode('cp1252'))
    128
    >>> ord(s.encode('iso-8859-15'))
    164
    

    Note that ord is here being used to get the ordinal number of the encoded byte. Using ord on the original unicode string would give its unicode code point:

    >>> ord(s)
    8364
    

    The reverse operation to ord can be done using either chr (for codes in the range 0 to 127) or unichr (for codes in the range 0 to sys.maxunicode):

    >>> print chr(65)
    A
    >>> print unichr(8364)
    €
    

    For multi-byte encodings, a simple "integer number" mapping is usually not possible.

    Here's the same example as above, but using "iso-8859-15" and "utf-8":

    >>> s = u'€'
    >>> s.encode('iso-8859-15')
    '\xa4'
    >>> s.encode('utf-8')
    '\xe2\x82\xac'
    >>> [ord(c) for c in s.encode('iso-8859-15')]
    [164]
    >>> [ord(c) for c in s.encode('utf-8')]
    [226, 130, 172]
    

    The "utf-8" encoding uses three bytes to encode the same character, so a one-to-one mapping is not possible. Having said that, many encodings (including "utf-8") are designed to be ASCII-compatible, so a mapping is usually possible for codes in the range 0-127 (but only trivially so, because the code will always be the same).

    0 讨论(0)
提交回复
热议问题