Emoji value range

前端 未结 5 824
独厮守ぢ
独厮守ぢ 2020-12-05 15:17

I was trying to take out all emoji chars out of a string (like a sanitizer). But I cannot find a complete set of emoji values.

What is the complete set of emoji char

相关标签:
5条回答
  • 2020-12-05 15:52

    Emoji ranges are updated for every new version of Unicode Emoji. Ranges below are correct for version 13.0

    Here is my gist for an advanced version of this code.

    def is_contains_emoji(p_string_in_unicode):
        """
        Instead of searching all chars of a text in a emoji lookup dictionary this function just
        checks whether any char in the text is in unicode emoji range
        It is much faster than a dictionary lookup for a large text
        However it only tells whether a text contains an emoji. It does not return the found emojis
        """
        range_min = ord(u'\U0001F300') # 127744
        range_max = ord(u'\U0001FAD6') # 129750
        range_min_2 = 126980
        range_max_2 = 127569
        range_min_3 = 169
        range_max_3 = 174
        range_min_4 = 8205
        range_max_4 = 12953
        if p_string_in_unicode:
            for a_char in p_string_in_unicode:
                char_code = ord(a_char)
                if range_min <= char_code <= range_max:
                    # or range_min_2 <= char_code <= range_max_2 or range_min_3 <= char_code <= range_max_3 or range_min_4 <= char_code <= range_max_4:
                    return True
                elif range_min_2 <= char_code <= range_max_2:
                    return True
                elif range_min_3 <= char_code <= range_max_3:
                    return True
                elif range_min_4 <= char_code <= range_max_4:
                    return True
            return False
        else:
            return False
    
    0 讨论(0)
  • 2020-12-05 15:55

    The Unicode standard's Unicode® Technical Report #51 includes a list of emoji (emoji-data.txt):

    ...
    21A9 ;  text ;  L1 ;    none ;  j   # V1.1 (↩) LEFTWARDS ARROW WITH HOOK
    21AA ;  text ;  L1 ;    none ;  j   # V1.1 (↪) RIGHTWARDS ARROW WITH HOOK
    231A ;  emoji ; L1 ;    none ;  j   # V1.1 (⌚) WATCH
    231B ;  emoji ; L1 ;    none ;  j   # V1.1 (⌛) HOURGLASS
    ...
    

    I believe you would want to remove each character listed in this document which had a Default_Emoji_Style of emoji.

    There is no way, other than reference to a definition list like this, to identify the emoji characters in Unicode. As the reference to the FAQ says, they are spread throughout different blocks.

    0 讨论(0)
  • 2020-12-05 15:57

    I have composed list based on Joe's and Doctor.Who's answers:

    U+00A9, U+00AE, U+203C, U+2049, U+20E3, U+2122, U+2139, U+2194-2199, U+21A9-21AA, U+231A, U+231B, U+2328, U+23CF, U+23E9-23F3, U+23F8-23FA, U+24C2, U+25AA, U+25AB, U+25B6, U+25C0, U+25FB-25FE, U+2600-27EF, U+2934, U+2935, U+2B00-2BFF, U+3030, U+303D, U+3297, U+3299, U+1F000-1F02F, U+1F0A0-1F0FF, U+1F100-1F64F, U+1F680-1F6FF, U+1F910-1F96B, U+1F980-1F9E0
    
    0 讨论(0)
  • 2020-12-05 16:05

    If you only deal with English character and emoji character I think it is doable. First convert your string to UTF-16 characters, then check each characters whose value is bigger than 0x0xD800 (for emoji it is actually >=0xD836) should be emoji.

    This is because "The Unicode standard permanently reserves the code point values between 0xD800 to 0xDFFF for UTF-16 encoding of the high and low surrogates" and of course English characters (and many other character won't fall in this range)

    But because emoji code point starts from U1F300 their UFT-16 value actually fall in this range.

    Check here for a quick reference for emoji UFT-16 value, if you don't bother to do it yourself.

    0 讨论(0)
  • 2020-12-05 16:10
    unicode-range: U+0080-02AF, U+0300-03FF, U+0600-06FF, U+0C00-0C7F, U+1DC0-1DFF, U+1E00-1EFF, U+2000-209F, U+20D0-214F, U+2190-23FF, U+2460-25FF, U+2600-27EF, U+2900-29FF, U+2B00-2BFF, U+2C60-2C7F, U+2E00-2E7F, U+3000-303F, U+A490-A4CF, U+E000-F8FF, U+FE00-FE0F, U+FE30-FE4F, U+1F000-1F02F, U+1F0A0-1F0FF, U+1F100-1F64F, U+1F680-1F6FF, U+1F910-1F96B, U+1F980-1F9E0;
    
    0 讨论(0)
提交回复
热议问题