问题
Consider this.
# -*- coding: utf-8 -*-
data = "cdbsb \xe2\x80\xa6 abc"
print data
#prints cdbsb … abc
^
print re.findall(ur"[\u2026]", data )
Why can't re
find this unicode character ? I have already checked
\xe2\x80\xa6 === … === U+2026
回答1:
My guess is that the issue is because data
is a byte-string. You might have the console encoding as utf-8
, hence when printing the string, the console converts the string to utf-8
and then shows it (You can check this out at sys.stdout.encoding
). Hence you are getting the character - …
.
But most probably re
does not do this decoding for you.
If you convert data
to utf-8
encoding, you would get the desired result when using re.findall
. Example -
>>> data = "cdbsb \xe2\x80\xa6 abc"
>>> print re.findall(ur"[\u2026]", data.decode('utf-8') )
[u'\u2026']
回答2:
data
is of str type and contains ASCII character with hex value. But the search term is of unicode type . Print function converts default to sys.stdout.encoding
. When I try to print data
as it is, the output differs from data.decode('utf-8')
. I am using Python 2.7
data = "cdbsb \xe2\x80\xa6 abc"
search = ur"[\u2026]"
print sys.stdout.encoding
## windows-1254
print data, type(data)
## cdbsb … abc <type 'str'>
print data.decode(sys.stdout.encoding)
## cdbsb … abc
print data.decode('utf-8')
## cdbsb … abc
print search, type(search)
## […] <type 'unicode'>
print re.findall(search, data.decode('utf-8'))
## [u'\u2026']
回答3:
If you go through the link provided by nhahtdh
Solving Unicode Problems in Python 2.7
You can see the original string was in bytes
and we were searching for unicode. So it should never have worked.
encode()
: Gets you from Unicode → bytes
decode()
: Gets you from bytes → Unicode
Following these we can solve it in 2 ways.
# -*- coding: utf-8 -*-
data = "cdbsb \xe2\x80\xa6 abc".decode("utf-8") #convert to unicode
print data
print re.findall(ur"[\u2026]", data )
print re.findall(ur"[\u2026]", data )[0].encode("utf-8") #compare with unicode byte string and then reconvert to bytes for print
data1 = "cdbsb \xe2\x80\xa6 abc" #let it remain bytes
print data1
print re.findall(r"\xe2\x80\xa6", data1 )[0] #search for bytes
回答4:
An alternative solution:
>>> data = "cdbsb \xe2\x80\xa6 abc"
>>> print data
cdbsb … abc
>>> if u"\u2026".encode('utf8') in data: print True
...
True
>>> if u"\u2026" in data.decode('utf8'): print True
...
True
来源:https://stackoverflow.com/questions/33031009/unicode-search-not-working