How to properly print a list of unicode characters in python?

寵の児 提交于 2020-08-07 05:21:09

问题


I am trying to search for emoticons in python strings. So I have, for example,

em_test = ['\U0001f680']
print(em_test)
['🚀']
test = 'This is a test string 💰💰🚀'
if any(x in test for x in em_test):
    print ("yes, the emoticon is there")
else: 
    print ("no, the emoticon is not there")

yes, the emoticon is there

and if a search em_test in

'This is a test string 💰💰🚀'

I can actually find it.

So I have made a csv file with all the emoticons I want defined by their unicode. The CSV looks like this:

\U0001F600

\U0001F601

\U0001F602

\U0001F923

and when I import it and print it I actullay do not get the emoticons but rather just the text representation:

['\\U0001F600',
 '\\U0001F601',
 '\\U0001F602',
 '\\U0001F923',
...
]

and hence I cannot use this to search for these emoticons in another string... I somehow know that the double backslash \ is only representation of a single slash but somehow the unicode reader does not get it... I do not know what I'm missing.

Any suggestions?


回答1:


You can decode those Unicode escape sequences with .decode('unicode-escape'). However, .decode is a bytes method, so if those sequences are text rather than bytes you first need to encode them into bytes. Alternatively, you can (probably) open your CSV file in binary mode in order to read those sequences as bytes rather than as text strings.

Just for fun, I'll also use unicodedata to get the names of those emojis.

import unicodedata as ud

emojis = [
    '\\U0001F600',
    '\\U0001F601',
    '\\U0001F602',
    '\\U0001F923',
]

for u in emojis:
    s = u.encode('ASCII').decode('unicode-escape')
    print(u, ud.name(s), s)

output

\U0001F600 GRINNING FACE 😀
\U0001F601 GRINNING FACE WITH SMILING EYES 😁
\U0001F602 FACE WITH TEARS OF JOY 😂
\U0001F923 ROLLING ON THE FLOOR LAUGHING 🤣

This should be much faster than using ast.literal_eval. And if you read the data in binary mode it will be even faster since it avoids the initial decoding step while reading the file, as well as allowing you to eliminate the .encode('ASCII') call.

You can make the decoding a little more robust by using

u.encode('Latin1').decode('unicode-escape')

but that shouldn't be necessary for your emoji data. And as I said earlier, it would be even better if you open the file in binary mode to avoid the need to encode it.




回答2:


1. keeping your csv as-is:

it's a bloated solution, but using ast.literal_eval works:

import ast

s = '\\U0001F600'

x = ast.literal_eval('"{}"'.format(s))
print(hex(ord(x)))
print(x)

I get 0x1f600 (which is correct char code) and some emoticon character (😀). (well I had to copy/paste a strange char from my console to this answer textfield but that's a console issue by my end, otherwise that works)

just surround with quotes to allow ast to take the input as string.

2. using character codes directly

maybe you'd be better off by storing the character codes themselves instead of the \U format:

print(chr(0x1F600))

does exactly the same (so ast is slightly overkill)

your csv could contain:

0x1F600
0x1F601
0x1F602
0x1F923

then chr(int(row[0],16)) would do the trick when reading it: example if one 1 row in CSV (or first row)

with open("codes.csv") as f:
   cr = csv.reader(f)
   codes = [int(row[0],16) for row in cr]


来源:https://stackoverflow.com/questions/47263783/how-to-properly-print-a-list-of-unicode-characters-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!