问题
I have a tree structure in which keywords may contain some latin characters. I have a function which loops through all leaves of the tree and adds each keyword to a list under certain conditions.
Here is the code I have for adding these keywords to the list:
print "Adding: " + self.keyword
leaf_list.append(self.keyword)
print leaf_list
If the keyword in this case is université
, then my output is:
Adding: université
['universit\xc3\xa9']
It appears that the print function properly shows the latin character, but when I add it to the list, it gets decoded.
How can I change this? I need to be able to print the list with the standard latin characters, not the decoded version of them.
回答1:
You don't have unicode objects, but byte strings with UTF-8 encoded text. Printing such byte strings to your terminal may work if your terminal is configured to handle UTF-8 text.
When converting a list to string, the list contents are shown as representations; the result of the repr()
function. The representation of a string object uses escape codes for any bytes outside of the printable ASCII range; newlines are replaced by \n
for example. Your UTF-8 bytes are represented by \xhh
escape sequences.
If you were using Unicode objects, the representation would use \xhh
escapes still, but for Unicode codepoints in the Latin-1 range (outside ASCII) only (the rest are shown with \uhhhh
and \Uhhhhhhhh
escapes depending on their codepoint); when printing Python automatically encodes such values to the correct encoding for your terminal:
>>> u'université'
u'universit\xe9'
>>> len(u'université')
10
>>> print u'université'
université
Compare this to byte strings:
>>> 'université'
'universit\xc3\xa9'
>>> len('université')
11
>>> 'université'.decode('utf8')
u'universit\xe9'
>>> print 'université'
université
Note that the length reflects that the é
codepoint is encoded to two bytes as well. It was my terminal that presented Python with the \xc3\xa9
bytes when pasting the é
character into the Python session, by the way, as it is configured to use UTF-8, and Python has detected this and decoded the bytes when I defined a u'..'
Unicode object literal.
I strongly recommend you read the following articles to understand how Python handles Unicode, and what the difference is between Unicode text and encoded byte strings:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
回答2:
When you print a list, you get the repr
of the items it contains, which for strings is different from their contents:
>>> a = ['foo', 'bär']
>>> print(a[0])
foo
>>> print(repr(a[0]))
'foo'
>>> print(a[1])
bär
>>> print(repr(a[1]))
'b\xc3\xa4r'
The output of repr
is supposed to be programmer-friendly, not user-friendly, hence the quotes and the hex codes. To print a list in a user-friendly way, write your own loop. E.g.
>>> print '[', ', '.join(a), ']'
[ foo, bär ]
来源:https://stackoverflow.com/questions/23220245/python-latin-characters-and-unicode