Capturing emoticons using regular expression in python

前端未结

关注

 4  530

I would like to have a regex pattern to match smileys \":)\" ,\":(\" .Also it should capture repeated smileys like \":) :)\" , \":) :(\" but filter out invalid syntax like \

相关标签:

4条回答

栀梦

2020-12-09 22:30

Try (?::|;|=)(?:-)?(?:\)|\(|D|P). Haven't tested it extensively, but does seem to match the right ones and not more...

In [15]: import re

In [16]: s = "Just: to :)) =) test :(:-(( ():: :):) :(:( :P ;)!"

In [17]: re.findall(r'(?::|;|=)(?:-)?(?:\)|\(|D|P)',s)
Out[17]: [':)', '=)', ':(', ':-(', ':)', ':)', ':(', ':(', ':P', ';)']

0 讨论(0)

半阙折子戏

2020-12-09 22:39
Maybe something like:
```
re.match('[:;][)(](?![)(])', str)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
别那么骄傲

2020-12-09 22:40
I got the answer I was looking for from the comments and answers posted here.
```
re.match("^(:[)(])*$",str)
```
Thanks to all.
0 讨论(0)
发布评论:

提交评论
- 加载中...
一个人的身影

2020-12-09 22:46
I think it finally "clicked" exactly what you're asking about here. Take a look at the below:
```
import re

smiley_pattern = '^(:$|:$)+$' # matches only the smileys ":)" and ":("

def test_match(s):
    print 'Value: %s; Result: %s' % (
        s,
        'Matches!' if re.match(smiley_pattern, s) else 'Doesn\'t match.'
    )

should_match = [
    ':)',   # Single smile
    ':(',   # Single frown
    ':):)', # Two smiles
    ':(:(', # Two frowns
    ':):(', # Mix of a smile and a frown
]
should_not_match = [
    '',         # Empty string
    ':(foo',    # Extraneous characters appended
    'foo:(',    # Extraneous characters prepended
    ':( :(',    # Space between frowns
    ':( (',     # Extraneous characters and space appended
    ':(('       # Extraneous duplicate of final character appended
]

print('The following should all match:')
for x in should_match: test_match(x);

print('')   # Newline for output clarity

print('The following should all not match:')
for x in should_not_match: test_match(x);
```
The problem with your original code is that your regex is wrong: (:$). Let's break it down.

The outside parentheses are a "grouping". They're what you'd reference if you were going to do a string replacement, and are used to apply regex operators on groups of characters at once. So, you're really saying:
- ( begin a group
  - :\( ... do regex stuff ...
- ')' end the group
The : isn't a regex reserved character, so it's just a colon. The \ is, and it means "the following character is literal, not a regex operator". This is called an "escape sequence". Fully parsed into English, your regex says
- ( begin a group
  - : a colon character
  - \( a left parenthesis character
- ) end the group
The regex I used is slightly more complex, but not bad. Let's break it down: ^(:\(|:$)+$.

^ and $ mean "the beginning of the line" and "the end of the line" respectively. Now we have ...
- ^ beginning of line
  - (:$|:$)+ ... do regex stuff ...
- $ end of line
... so it only matches things that comprise the entire line, not simply occur in the middle of the string.

We know that ( and ) denote a grouping. + means "one of more of these". Now we have:
- ^ beginning of line
- ( start a group
  - :$|:$ ... do regex stuff ...
- ) end the group
- + match one or more of this
- $ end of line
Finally, there's the | (pipe) operator. It means "or". So, applying what we know from above about escaping characters, we're ready to complete the translation:
- ^ beginning of line
- ( start a group
  - : a colon character
  - $ a left parenthesis character
- | or
  - : a colon character
  - $ a right parenthesis character
- ) end the group
- + match one or more of this
- $ end of line
I hope this helps. If not, let me know and I'll be happy to edit my answer with a reply.
0 讨论(0)
发布评论:

提交评论
- 加载中...