Capturing emoticons using regular expression in python

前端 未结 4 530
忘了有多久
忘了有多久 2020-12-09 21:58

I would like to have a regex pattern to match smileys \":)\" ,\":(\" .Also it should capture repeated smileys like \":) :)\" , \":) :(\" but filter out invalid syntax like \

相关标签:
4条回答
  • 2020-12-09 22:30

    Try (?::|;|=)(?:-)?(?:\)|\(|D|P). Haven't tested it extensively, but does seem to match the right ones and not more...

    In [15]: import re
    
    In [16]: s = "Just: to :)) =) test :(:-(( ():: :):) :(:( :P ;)!"
    
    In [17]: re.findall(r'(?::|;|=)(?:-)?(?:\)|\(|D|P)',s)
    Out[17]: [':)', '=)', ':(', ':-(', ':)', ':)', ':(', ':(', ':P', ';)']
    
    0 讨论(0)
  • 2020-12-09 22:39

    Maybe something like:

    re.match('[:;][)(](?![)(])', str)
    
    0 讨论(0)
  • 2020-12-09 22:40

    I got the answer I was looking for from the comments and answers posted here.

    re.match("^(:[)(])*$",str)
    

    Thanks to all.

    0 讨论(0)
  • 2020-12-09 22:46

    I think it finally "clicked" exactly what you're asking about here. Take a look at the below:

    import re
    
    smiley_pattern = '^(:\(|:\))+$' # matches only the smileys ":)" and ":("
    
    def test_match(s):
        print 'Value: %s; Result: %s' % (
            s,
            'Matches!' if re.match(smiley_pattern, s) else 'Doesn\'t match.'
        )
    
    should_match = [
        ':)',   # Single smile
        ':(',   # Single frown
        ':):)', # Two smiles
        ':(:(', # Two frowns
        ':):(', # Mix of a smile and a frown
    ]
    should_not_match = [
        '',         # Empty string
        ':(foo',    # Extraneous characters appended
        'foo:(',    # Extraneous characters prepended
        ':( :(',    # Space between frowns
        ':( (',     # Extraneous characters and space appended
        ':(('       # Extraneous duplicate of final character appended
    ]
    
    print('The following should all match:')
    for x in should_match: test_match(x);
    
    print('')   # Newline for output clarity
    
    print('The following should all not match:')
    for x in should_not_match: test_match(x);
    

    The problem with your original code is that your regex is wrong: (:\(). Let's break it down.

    The outside parentheses are a "grouping". They're what you'd reference if you were going to do a string replacement, and are used to apply regex operators on groups of characters at once. So, you're really saying:

    • ( begin a group
      • :\( ... do regex stuff ...
    • ')' end the group

    The : isn't a regex reserved character, so it's just a colon. The \ is, and it means "the following character is literal, not a regex operator". This is called an "escape sequence". Fully parsed into English, your regex says

    • ( begin a group
      • : a colon character
      • \( a left parenthesis character
    • ) end the group

    The regex I used is slightly more complex, but not bad. Let's break it down: ^(:\(|:\))+$.

    ^ and $ mean "the beginning of the line" and "the end of the line" respectively. Now we have ...

    • ^ beginning of line
      • (:\(|:\))+ ... do regex stuff ...
    • $ end of line

    ... so it only matches things that comprise the entire line, not simply occur in the middle of the string.

    We know that ( and ) denote a grouping. + means "one of more of these". Now we have:

    • ^ beginning of line
    • ( start a group
      • :\(|:\) ... do regex stuff ...
    • ) end the group
    • + match one or more of this
    • $ end of line

    Finally, there's the | (pipe) operator. It means "or". So, applying what we know from above about escaping characters, we're ready to complete the translation:

    • ^ beginning of line
    • ( start a group
      • : a colon character
      • \( a left parenthesis character
    • | or
      • : a colon character
      • \) a right parenthesis character
    • ) end the group
    • + match one or more of this
    • $ end of line

    I hope this helps. If not, let me know and I'll be happy to edit my answer with a reply.

    0 讨论(0)
提交回复
热议问题