I want a regex to match a simple hashtag like that in twitter (e.g. #someword). I want it also to recognize non standard characters (like those in Spanish, Hebrew or Chinese
Eventually I found this: twitter-text.js useful link, which is basically how twitter solve this problem.
With native JS regexes that don't support unicode, your only option is to explicitly enumerate characters that can end the tag and match everything else, for example:
> s = "foo #הַתִּקְוָה. bar"
"foo #הַתִּקְוָה. bar"
> s.match(/#(.+?)(?=[\s.,:,]|$)/)
["#הַתִּקְוָה", "הַתִּקְוָה"]
The [\s.,:,]
should include spaces, punctuation and whatever else can be considered a terminating symbol.
#([^#]+)[\s,;]*
Explanation: This regular expression will search for a #
followed by one or more non-#
characters, followed by 0 or more spaces, commas or semicolons.
var input = "#hasta #mañana #babהַ";
var matches = input.match(/#([^#]+)[\s,;]*/g);
Result:
["#hasta ", "#mañana ", "#babהַ"]
EDIT - Replaced \b for word boundary