remove unicode emoji using re in python

前端 未结 1 757
北海茫月
北海茫月 2020-11-28 08:04

I tried to remove the emoji from a unicode tweet text and print out the result in python 2.7 using

myre = re.compile(u\'[\\u1F300-\\u1F5FF\\u1F600-\\u1F64F\\         


        
相关标签:
1条回答
  • 2020-11-28 09:05

    You are not using the correct notation for non-BMP unicode points; you want to use \U0001FFFF, a capital U and 8 digits:

    myre = re.compile(u'['
        u'\U0001F300-\U0001F5FF'
        u'\U0001F600-\U0001F64F'
        u'\U0001F680-\U0001F6FF'
        u'\u2600-\u26FF\u2700-\u27BF]+', 
        re.UNICODE)
    

    This can be reduced to:

    myre = re.compile(u'['
        u'\U0001F300-\U0001F64F'
        u'\U0001F680-\U0001F6FF'
        u'\u2600-\u26FF\u2700-\u27BF]+', 
        re.UNICODE)
    

    as your first two ranges are adjacent.

    Your version was specifying (with added spaces for readability):

    [\u1F30 0-\u1F5F F\u1F60 0-\u1F64 F\u1F68 0-\u1F6F F \u2600-\u26FF\u2700-\u27BF]+
    

    That's because the \uxxxx escape sequence always takes only 4 hex digits, not 5.

    The largest of those ranges is 0-\u1F6F (so from the digit 0 through to Ὧ), which covers a very large swathe of the Unicode standard.

    The corrected expression works, provided you use a UCS-4 wide Python executable:

    >>> import re
    >>> myre = re.compile(u'['
    ...     u'\U0001F300-\U0001F64F'
    ...     u'\U0001F680-\U0001F6FF'
    ...     u'\u2600-\u26FF\u2700-\u27BF]+', 
    ...     re.UNICODE)
    >>> myre.sub('', u'Some example text with a sleepy face: \U0001f62a')
    u'Some example text with a sleepy face: '
    

    The UCS-2 equivalent is:

    myre = re.compile(u'('
        u'\ud83c[\udf00-\udfff]|'
        u'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
        u'[\u2600-\u26FF\u2700-\u27BF])+', 
        re.UNICODE)
    

    You can combine the two into your script with a exception handler:

    try:
        # Wide UCS-4 build
        myre = re.compile(u'['
            u'\U0001F300-\U0001F64F'
            u'\U0001F680-\U0001F6FF'
            u'\u2600-\u26FF\u2700-\u27BF]+', 
            re.UNICODE)
    except re.error:
        # Narrow UCS-2 build
        myre = re.compile(u'('
            u'\ud83c[\udf00-\udfff]|'
            u'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
            u'[\u2600-\u26FF\u2700-\u27BF])+', 
            re.UNICODE)
    

    Of course, the regex is already out of date, as it doesn't cover Emoji defined in newer Unicode releases; it appears to cover Emoji's defined up to Unicode 8.0 (since U+1F91D HANDSHAKE was added in Unicode 9.0).

    If you need a more up-to-date regex, take one from a package that is actively trying to keep up-to-date on Emoji; it specifically supports generating such a regex:

    import emoji
    
    def remove_emoji(text):
        return emoji.get_emoji_regexp().sub(u'', text)
    

    The package is currently up-to-date for Unicode 11.0 and has the infrastructure in place to update to future releases quickly. All your project has to do is upgrade along when there is a new release.

    0 讨论(0)
提交回复
热议问题