Python Latin Characters and Unicode

前端 未结 2 929
抹茶落季
抹茶落季 2020-12-22 03:08

I have a tree structure in which keywords may contain some latin characters. I have a function which loops through all leaves of the tree and adds each keyword to a list un

相关标签:
2条回答
  • 2020-12-22 03:50

    You don't have unicode objects, but byte strings with UTF-8 encoded text. Printing such byte strings to your terminal may work if your terminal is configured to handle UTF-8 text.

    When converting a list to string, the list contents are shown as representations; the result of the repr() function. The representation of a string object uses escape codes for any bytes outside of the printable ASCII range; newlines are replaced by \n for example. Your UTF-8 bytes are represented by \xhh escape sequences.

    If you were using Unicode objects, the representation would use \xhh escapes still, but for Unicode codepoints in the Latin-1 range (outside ASCII) only (the rest are shown with \uhhhh and \Uhhhhhhhh escapes depending on their codepoint); when printing Python automatically encodes such values to the correct encoding for your terminal:

    >>> u'université'
    u'universit\xe9'
    >>> len(u'université')
    10
    >>> print u'université'
    université
    

    Compare this to byte strings:

    >>> 'université'
    'universit\xc3\xa9'
    >>> len('université')
    11
    >>> 'université'.decode('utf8')
    u'universit\xe9'
    >>> print 'université'
    université
    

    Note that the length reflects that the é codepoint is encoded to two bytes as well. It was my terminal that presented Python with the \xc3\xa9 bytes when pasting the é character into the Python session, by the way, as it is configured to use UTF-8, and Python has detected this and decoded the bytes when I defined a u'..' Unicode object literal.

    I strongly recommend you read the following articles to understand how Python handles Unicode, and what the difference is between Unicode text and encoded byte strings:

    • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

    • The Python Unicode HOWTO

    • Pragmatic Unicode by Ned Batchelder

    0 讨论(0)
  • 2020-12-22 04:11

    When you print a list, you get the repr of the items it contains, which for strings is different from their contents:

    >>> a = ['foo', 'bär']
    >>> print(a[0])
    foo
    >>> print(repr(a[0]))
    'foo'
    >>> print(a[1])
    bär
    >>> print(repr(a[1]))
    'b\xc3\xa4r'
    

    The output of repr is supposed to be programmer-friendly, not user-friendly, hence the quotes and the hex codes. To print a list in a user-friendly way, write your own loop. E.g.

    >>> print '[', ', '.join(a), ']'
    [ foo, bär ]
    
    0 讨论(0)
提交回复
热议问题