Match unicode emoji in python regex

前端 未结 3 1166
时光说笑
时光说笑 2021-01-18 02:49

I need to extract the text between a number and an emoticon in a text

example text:

blah xzuyguhbc ibcbb bqw 2 extract1  ☺️ jbjhcb 6 extract2          


        
相关标签:
3条回答
  • 2021-01-18 03:15

    So this may or not work depending on your needs. If you know the emoji's ahead of time though this will probably work, you just need a list of the types of emoticons to expect.

    Anyway without more information, this is what I'd do.

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    
    import re
    
    my_regex = re.compile(r'\d\s*([^☺️|^                                                                    
    0 讨论(0)
  • 2021-01-18 03:30

    Since there are a lot of emoji with different unicode values, you have to explicitly specify them in your regex, or if they are with a spesific range you can use a character class. In this case your second simbol is not a standard emoji, it's just a unicode character, but since it's greater than \u263a (the unicode representation of ☺️) you can put it in a range with \u263a:

    In [71]: s = 'blah xzuyguhbc ibcbb bqw 2 extract1  ☺️ jbjhcb 6 extract2                                                                     
    0 讨论(0)
  • 2021-01-18 03:31

    Here's my stab at the solution. Not sure if it will work in all circumstances. The trick is to convert all unicode emojis into normal text. This could be done by following this post Then you can match the emoji just as any normal text. Note that it won't work if the literal strings \u or \U is in your searched text.

    Example: Copy your string into a file, let's call it emo. In terminal:

    Chip chip@ 03:24:33@ ~: cat emo | python stackoverflow.py
    blah xzuyguhbc ibcbb bqw 2 extract1  \u263a\ufe0f jbjhcb 6 extract2 \U0001f645 bjvcvvv\n
    ------------------------
    [' extract1  ', ' extract2 ']
    

    Where stackoverflow.py file is:

    import fileinput
    a = fileinput.input();
    for line in a:
        teststring = unicode(line,'utf-8')
        teststring = teststring.encode('unicode-escape')
    
    import re
    print teststring
    print "------------------------"
    m = re.findall('(?<=[\s][\d])(.*?)(?=\\\\[uU])', teststring)
    print m
    
    0 讨论(0)
提交回复
热议问题