I need to extract the text between a number and an emoticon in a text
example text:
blah xzuyguhbc ibcbb bqw 2 extract1 ☺️ jbjhcb 6 extract2
So this may or not work depending on your needs. If you know the emoji's ahead of time though this will probably work, you just need a list of the types of emoticons to expect.
Anyway without more information, this is what I'd do.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
my_regex = re.compile(r'\d\s*([^☺️|^
Since there are a lot of emoji with different unicode values, you have to explicitly specify them in your regex, or if they are with a spesific range you can use a character class. In this case your second simbol is not a standard emoji, it's just a unicode character, but since it's greater than \u263a
(the unicode representation of ☺️) you can put it in a range with \u263a
:
In [71]: s = 'blah xzuyguhbc ibcbb bqw 2 extract1 ☺️ jbjhcb 6 extract2
Here's my stab at the solution. Not sure if it will work in all circumstances. The trick is to convert all unicode emojis into normal text. This could be done by following this post Then you can match the emoji just as any normal text. Note that it won't work if the literal strings \u
or \U
is in your searched text.
Example: Copy your string into a file, let's call it emo
.
In terminal:
Chip chip@ 03:24:33@ ~: cat emo | python stackoverflow.py
blah xzuyguhbc ibcbb bqw 2 extract1 \u263a\ufe0f jbjhcb 6 extract2 \U0001f645 bjvcvvv\n
------------------------
[' extract1 ', ' extract2 ']
Where stackoverflow.py
file is:
import fileinput
a = fileinput.input();
for line in a:
teststring = unicode(line,'utf-8')
teststring = teststring.encode('unicode-escape')
import re
print teststring
print "------------------------"
m = re.findall('(?<=[\s][\d])(.*?)(?=\\\\[uU])', teststring)
print m