问题
I have a list of codes of emoticons inside a file UTF32.red.codes
in plain text. The plain content of the file is
\U0001F600
\U0001F601
\U0001F602
\U0001F603
\U0001F604
\U0001F605
\U0001F606
\U0001F609
\U0001F60A
\U0001F60B
Based on question, my idea is to create regular expression from the content of the file in order to catch emoticons. This is my minimal working example
import re
with open('UTF32.red.codes','r') as emof:
codes = [emo.strip() for emo in emof]
emojis = re.compile(u"(%s)" % "|".join(codes))
string = u'string to check \U0001F601'
found = emojis.findall(string)
print found
found
is always empty. Where I am wrong? I am using python 2.7
回答1:
Your code will be doing fine in python 3 (just fix print found
to print(found)
). However, in python 2.7 it won't work, as its re
module has a known bug (See this thread and this issue).
If you still need python 2 version of code, just use regex
module, which could be installed with pip2 install regex
. Import it with import regex
then, substitute all re.
statements with regex.
(i.e. regex.compile
and regex.findall
) and that's it. It should be working.
回答2:
This code works with python 2.7
import re
with open('UTF32.red.codes','rb') as emof:
codes = [emo.decode('unicode-escape').strip() for emo in emof]
emojis = re.compile(u"(%s)" % "|".join(map(re.escape,codes)))
search = ur'string to check \U0001F601'
found = emojis.findall(search)
print found
来源:https://stackoverflow.com/questions/34681364/how-to-build-a-regular-vocabulary-of-emoticons-in-python