Regex matching emoticons

We are working on a project where we want users to be able to use both emoji syntax (like :smile:, :heart:, :confused:,:stuck_out_tongue:) as well as normal emoticons (like :), <3, :/, :p)

I'm having trouble with the emoticon syntax because sometimes those character sequences will occur in:

normal strings or URL's - http://example.com
within the emoji syntax - :pencil:

How can I find these emoticon character sequences but not when other characters are near them?

The entire regex I'm using for all the emoticons is huge, so here's a trimed down version:

(\:\)|\:\(|<3|\:\/|\:-\/|\:\||\:p)

You can play with a demo of it in action here: http://regexr.com/3a8o5

Match emoji first (to take care of the :pencil: example) and then check for a terminating whitespace or newline:

(\:\w+\:|\<[\/\\]?3|[\(\)\\\D|\*\$][\-\^]?[\:\;\=]|[\:\;\=B8][\-\^]?[3DOPp\@\$\*\\\)\(\/\|])(?=\s|[\!\.\?]|$)

This regex matches the following (preferring emoji) returning the match in matching group 1:

:( :) :P :p :O :3 :| :/ :\ :$ :* :@
:-( :-) :-P :-p :-O :-3 :-| :-/ :-\ :-$ :-* :-@
:^( :^) :^P :^p :^O :^3 :^| :^/ :^\ :^$ :^* :^@
): (: $: *:
)-: (-: $-: *-:
)^: (^: $^: *^:
<3 </3 <\3
:smile: :hug: :pencil:

It also supports terminal punctuation as a delimiter in addition to white space.

You can see more details and test it here: https://regex101.com/r/aM3cU7/4

I assume these emoticons will commonly be used with spaces before and after. Then \s might be what you're looking for, as it represents a white space.

Then your regex would become

\s+(\:\)|\:\(|<3|\:\/|\:-\/|\:\||\:p)\s

Make a positive look-ahead for a space

([\:\<]-?[)(|\\/pP3D])(?:(?=\s))
 |       |      |         |
 |       |      |         |
 |       |      |         |-> match last separating space
 |       |      |-> match last part of the emot
 |       |-> it may have a `-` or not 
 |-> first part of the emoticon

Since you're using javascript, and you don't have access to look arounds:

/([\:\<]-?[)|\\/pP3D])(\s|$)/g.exec('hi :) ;D');

And then just splice() the resulting array out of its last entry (that's most probably a space)

You want regex look-arounds regarding spacing. Another answer here suggested a positive look-ahead, though I'd go double-negative:

(?<!\S)(\:\)|\:\(|<3|\:\/|\:-\/|\:\||\:p)(?!\S)

While JavaScript doesn't support (?<!pattern), look-behind can be mimicked:

test_string.replace(/(\S)?(\:\)|\:\(|<3|\:\/|\:-\/|\:\||\:p)(?!\S)/,
                    function($0, $1) { return $1 ? $0 : replacement_text; });

All I did was prefix your code with (?<!\S) in front and suffix with(?!\S) in back. The prefix ensures you do not follow a non-whitespace character, so the only valid leading entries are spaces or nothing (start of line). The suffix does the same thing, ensuring you are not followed by a non-whitespace character. See also this more thorough regex walk-through.

One of the comments to the question itself was suggesting \b (word boundary) markers. I don't recommend these. In fact, this suggestion would do the opposite of what you want; \b:/ will indeed match http:// since there is a word boundary between the p and the :. This kind of reasoning would suggest \B (not a word boundary), e.g. \B:/\B. This is more portable (it works with pretty much all regex parsers while look-arounds do not), and you can choose it in that case, but I prefer the look-arounds.

来源：https://stackoverflow.com/questions/28077049/regex-matching-emoticons

标签

javascript

regex

emoji

emoticons