Test a string if it's Unicode, which UTF standard is and get its length in bytes?

独自空忆成欢 提交于 2019-11-27 00:57:17

问题


I need to test if a string is Unicode, and then if it whether it's UTF-8. After that, get the string's length in bytes including the BOM, if it ever uses that. How can this be done in Python?

Also for didactic purposes, what does a byte list representation of a UTF-8 string look like? I am curious how a UTF-8 string is represented in Python.

Latter edit: pprint does that pretty well.


回答1:


try:
    string.decode('utf-8')
    print "string is UTF-8, length %d bytes" % len(string)
except UnicodeError:
    print "string is not UTF-8"

In Python 2, str is a sequence of bytes and unicode is a sequence of characters. You use str.decode to decode a byte sequence to unicode, and unicode.encode to encode a sequence of characters to str. So for example, u"é" is the unicode string containing the single character U+00E9 and can also be written u"\xe9"; encoding into UTF-8 gives the byte sequence "\xc3\xa9".

In Python 3, this is changed; bytes is a sequence of bytes and str is a sequence of characters.




回答2:


To Check if Unicode

>>>a = u'F'
>>>isinstance(a, unicode)
True

To Check if it is UTF-8 or ASCII

>>>import chardet
>>>encoding = chardet.detect('AA')
>>>encoding['encoding']
'ascii'



回答3:


I would definitely recommend Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!), if you haven't already read it.

For Python's Unicode and encoding/decoding machinery, start here. To get the byte-length of a Unicode string encoded in utf-8, you could do:

print len(my_unicode_string.encode('utf-8'))

Your question is tagged python-2.5, but be aware that this changes somewhat in Python 3+.



来源:https://stackoverflow.com/questions/12053107/test-a-string-if-its-unicode-which-utf-standard-is-and-get-its-length-in-bytes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!