Unicode search not working

我只是一个虾纸丫 提交于 2019-12-09 23:20:17

问题


Consider this.

# -*- coding: utf-8 -*-
data = "cdbsb \xe2\x80\xa6 abc"
print data 
#prints cdbsb … abc
              ^
print re.findall(ur"[\u2026]", data )

Why can't re find this unicode character ? I have already checked

\xe2\x80\xa6 === … === U+2026

回答1:


My guess is that the issue is because data is a byte-string. You might have the console encoding as utf-8 , hence when printing the string, the console converts the string to utf-8 and then shows it (You can check this out at sys.stdout.encoding ). Hence you are getting the character - .

But most probably re does not do this decoding for you.

If you convert data to utf-8 encoding, you would get the desired result when using re.findall. Example -

>>> data = "cdbsb \xe2\x80\xa6 abc"
>>> print re.findall(ur"[\u2026]", data.decode('utf-8') )
[u'\u2026']



回答2:


data is of str type and contains ASCII character with hex value. But the search term is of unicode type . Print function converts default to sys.stdout.encoding. When I try to print data as it is, the output differs from data.decode('utf-8'). I am using Python 2.7

data = "cdbsb \xe2\x80\xa6 abc"
search = ur"[\u2026]"

print sys.stdout.encoding
## windows-1254

print data, type(data)
## cdbsb … abc <type 'str'>

print data.decode(sys.stdout.encoding)
## cdbsb … abc

print data.decode('utf-8')
## cdbsb … abc

print search, type(search)
## […] <type 'unicode'>

print re.findall(search, data.decode('utf-8'))
## [u'\u2026']



回答3:


If you go through the link provided by nhahtdh

Solving Unicode Problems in Python 2.7

You can see the original string was in bytes and we were searching for unicode. So it should never have worked.

encode(): Gets you from Unicode → bytes

decode(): Gets you from bytes → Unicode

Following these we can solve it in 2 ways.

# -*- coding: utf-8 -*-
data = "cdbsb \xe2\x80\xa6 abc".decode("utf-8")  #convert to unicode
print data
print re.findall(ur"[\u2026]", data )
print re.findall(ur"[\u2026]", data )[0].encode("utf-8")  #compare with unicode byte string and then reconvert to bytes for print

data1 = "cdbsb \xe2\x80\xa6 abc"  #let it remain bytes
print data1
print re.findall(r"\xe2\x80\xa6", data1 )[0] #search for bytes



回答4:


An alternative solution:

>>> data = "cdbsb \xe2\x80\xa6 abc"
>>> print data 
cdbsb … abc
>>> if u"\u2026".encode('utf8') in data: print True
... 
True
>>> if u"\u2026" in data.decode('utf8'): print True
... 
True


来源:https://stackoverflow.com/questions/33031009/unicode-search-not-working

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!