Regex for a (twitter-like) hashtag that allows non-ASCII characters

前端 未结 3 1294
遥遥无期
遥遥无期 2020-12-03 18:08

I want a regex to match a simple hashtag like that in twitter (e.g. #someword). I want it also to recognize non standard characters (like those in Spanish, Hebrew or Chinese

相关标签:
3条回答
  • 2020-12-03 18:28

    Eventually I found this: twitter-text.js useful link, which is basically how twitter solve this problem.

    0 讨论(0)
  • 2020-12-03 18:29

    With native JS regexes that don't support unicode, your only option is to explicitly enumerate characters that can end the tag and match everything else, for example:

    > s = "foo #הַתִּקְוָה. bar"
    "foo #הַתִּקְוָה. bar"
    > s.match(/#(.+?)(?=[\s.,:,]|$)/)
    ["#הַתִּקְוָה", "הַתִּקְוָה"]
    

    The [\s.,:,] should include spaces, punctuation and whatever else can be considered a terminating symbol.

    0 讨论(0)
  • 2020-12-03 18:31

    #([^#]+)[\s,;]*

    Explanation: This regular expression will search for a # followed by one or more non-# characters, followed by 0 or more spaces, commas or semicolons.

    var input = "#hasta #mañana #babהַ";
    var matches = input.match(/#([^#]+)[\s,;]*/g);
    

    Result:

    ["#hasta ", "#mañana ", "#babהַ"]
    

    EDIT - Replaced \b for word boundary

    0 讨论(0)
提交回复
热议问题