How to build a regular vocabulary of emoticons in python?

点点圈 提交于 2019-12-23 13:02:45

问题


I have a list of codes of emoticons inside a file UTF32.red.codes in plain text. The plain content of the file is

\U0001F600
\U0001F601
\U0001F602
\U0001F603 
\U0001F604
\U0001F605
\U0001F606
\U0001F609
\U0001F60A
\U0001F60B

Based on question, my idea is to create regular expression from the content of the file in order to catch emoticons. This is my minimal working example

import re

with open('UTF32.red.codes','r') as emof:
   codes = [emo.strip() for emo in emof]
   emojis = re.compile(u"(%s)" % "|".join(codes))

string = u'string to check \U0001F601'
found = emojis.findall(string)

print found

found is always empty. Where I am wrong? I am using python 2.7


回答1:


Your code will be doing fine in python 3 (just fix print found to print(found)). However, in python 2.7 it won't work, as its re module has a known bug (See this thread and this issue).

If you still need python 2 version of code, just use regex module, which could be installed with pip2 install regex. Import it with import regex then, substitute all re. statements with regex. (i.e. regex.compile and regex.findall) and that's it. It should be working.




回答2:


This code works with python 2.7

import re
with open('UTF32.red.codes','rb') as emof:
    codes = [emo.decode('unicode-escape').strip() for emo in emof]
    emojis = re.compile(u"(%s)" % "|".join(map(re.escape,codes)))

search = ur'string to check \U0001F601'
found = emojis.findall(search)

print found


来源:https://stackoverflow.com/questions/34681364/how-to-build-a-regular-vocabulary-of-emoticons-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!