How to extract emojis and flags from strings in Python?

问题

import emoji

def emoji_lis(string):
_entities = []
for pos,c in enumerate(string):
    if c in emoji.UNICODE_EMOJI:
        print("Matched!!", c ,c.encode('ascii',"backslashreplace"))
        _entities.append({
            "location":pos,
            "emoji": c
            })
return _entities

emoji_lis("👧🏿 مدیحہ🇵🇰  así, se 😌 ds 💕👭")

Matched!! 👧 \U0001f467
Matched!! 🏿 \U0001f3ff
Matched!! 😌 \U0001f60c
Matched!! 💕 \U0001f495
Matched!! 👭 \U0001f46d

My code is working of all other emoji's but how can I detect country flags 🇵🇰?

回答1:

Here is an article about how Unicode encodes country flags. They are represented as sequences of two regional indicator symbols (code points ranging from U+1F1E6 to U+1F1FF), although obviously not every possible combination of two symbols corresponds to a country (and therefore a flag), obviously. You could just assume that no "bad" combinations will happen or maintain (or import) a set with the (currently) 270 valid pairs of symbols.

Then there are regional flags. These are represented as a black flag code point (U+1F3F4) followed by a sequence of tags (code points U+E0001 and range from U+E0020 to U+E007F) spelling the region identifier (for example, for the flag or Wales that would be "gbwls"), plus a "cancel tag" code point (U+E007F).

And, besides all that, you also have of course regular emojis that look like flags. The aforementioned black flag (U+1F3F4) is one of them, but you also have triangular flag (U+1F6A9), etc. Most of these you should already be able to detect, since they are just like other emojis. However, we are not quite done here. You have the issue of composite emojis, which affects some flags but also many other emojis. In your example, you can see that the matched emoji for the black woman in the input string is a "base" woman emoji, and then this brown patch. This is because the black woman emoji is made up of two code points, woman (U+1F469) and dark skin tone (U+1F311). In many other cases, you would need the two code points, plus a zero-width joiner (U+200D) in between, to specify that you want them merged. And sometimes you also need to throw in a variation selector (typically 16, U+FE0F) to indicate that you want things to be used as emojis. You can read more about this in this article. In the case of flags, you have for example the rainbow flag (U+1F3F3, U+FE0F,‍ U+200D, U+1F308), that would read "white flag, variation selector 16 (to use white flag emoji, not text), zero-width joiner, rainbow"; or the pirate flag (U+1F3F4,‍ U+200D, U+2620, U+FE0F), that would read "black flag, zero-width joiner, skull and crossbones, variation selector 16 (to use skull and crossbones emoji, not text)".

Now, there are different ways you can deal with all this, but in your current approach you are iterating one code point at a time, so you will not be able to detect complex emojis. You can just have a big set of all interesting sequences (flags, some composite emojis, etc.) and look for them in the input. You can check if the current character is a regional indicator symbol and, if that is the case, try to read the next code point to form a flag (and settle for individual simple emojis for the rest). I would not know for sure what is the best solution for your case (in terms complexity/benefits trade-off), but you should be aware of the nuances of emoji encoding and the pitfalls you may find.

回答2:

I don't think theres a library anywhere to do this. However, this can somewhat be done with a function:

\U0001F1E6\U0001F1E8 is the first unicode flag and \U0001F1FF\U0001F1FC is the last, so that covers almost all of them. Theres 3 more that cause some issues.

Heres a function that would check if the input is a flag:

def is_flag_emoji(c):
    return "\U0001F1E6\U0001F1E8" <= c <= "\U0001F1FF\U0001F1FC" or c in ["\U0001F3F4\U000e0067\U000e0062\U000e0065\U000e006e\U000e0067\U000e007f", "\U0001F3F4\U000e0067\U000e0062\U000e0073\U000e0063\U000e0074\U000e007f", "\U0001F3F4\U000e0067\U000e0062\U000e0077\U000e006c\U000e0073\U000e007f"]

Testing:

>>> is_flag_emoji('a')
False
>>> is_flag_emoji('😌')
False
>>> is_flag_emoji("""🇦🇮""")
True

So you could accordingly change your if statement to if c in emoji.UNICODE_EMOJI or is_flag_emoji(c):.

There is an issue with this though; since a lot flags are made by joining multiple characters, you probably wont be able to identify the emoji.

>>> s
'🇾🇪 here is more text 🇦🇩 and more'
>>>emoji_lis(s)
Matched!! 🇾 b'\\U0001f1fe'
Matched!! 🇪 b'\\U0001f1ea'
Matched!! 🇩 b'\\U0001f1e9'
[{'location': 0, 'emoji': '🇾'}, {'location': 1, 'emoji': '🇪'}, {'location': 22, 'emoji': '🇩'}]

来源：https://stackoverflow.com/questions/49276977/how-to-extract-emojis-and-flags-from-strings-in-python

标签

python

string

emoji

data-cleaning