_Actual_ Twitter format for hashtags? Not your regex, not his code— the actual one?

那年仲夏 提交于 2019-12-03 05:32:27
jball

From the starting point of twitter's support the basic rules seems to be that hashtags must be preceded by a space and stop on any whitespace or punctuation.


Quote from Twitter's support:

Check your hashtags for the following:

  • Is there any symbol in or after the hashtag?
    • If you write #noican't, your message will be categorized under #noican. Punctuation marks ( , . ; ' ? ! etc.) will end your hashtag wherever punctuation occurs.
  • Is there any letter preceding the #symbol?
    • If you write 23#idoittoo or word#idoittoo, your Tweets will not show in searches for the hashtag #idoittoo. Hashtags will not work with letters or numbers in front of the # symbol. The # symbol must have a space directly in front of it in order for it to show correctly in searches.

Therefore, the initial token is # preceded by a space, and the terminator is any whitespace or punctuation. The "etc" in their list of punctuation (" , . ; ' ? ! etc.") is annoying, but I'll keep digging and see if I can find something authoritative on what else counts as punctuation.

After digging around a while, I found some interesting blog articles by Terence Eden (Hashtags and Implicit Knowledge, Hashtag Standards) that provide evidence that Twitter doesn't even have a standard, given that the software it develops on different platforms seems to have different rules of what constitutes a hashtag.

It also provided a link to the Twitter Conformance Library, which has twitter / twitter-text-conformance / autolink.yml. The hashtag section in autolink.yml has many cases matching the above rules, but also some that violate them are are still supposed to be autolinked. Some examples:

- description: "DO NOT Autolink all-numeric hashtags"
  text: "text #1234"
  expected: "text #1234"

- description: "Autolink hashtag preceded by a period"
  text: "text.#hashtag"
  expected: "text.<a href=\"http://twitter.com/search?q=%23hashtag\" title=\"#hashtag\" class=\"tweet-url hashtag\">#hashtag</a>"

- description: "Autolink hashtag with full-width hash (U+FF03)"
  text: "#hashtag"
  expected: "<a href=\"http://twitter.com/search?q=%23hashtag\" title=\"#hashtag\" class=\"tweet-url hashtag\">#hashtag</a>"

Those are just a few examples that don't match the basic rules given in the first support article, and unfortunately the yml is full of other examples as well.

user1122127

There is in fact an official specification for hashtags. Twitter accepts only a subset of Unicode expressions for the hashtag syntax. Here is the regular expression to recognize all valid Hashtags used on Twitter (pulled from their own sourcecode.)

To see how it's generated see the source code of twitter-text.

/(#|#)([a-z0-9_\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u00ff\u0100-\u024f\u0253-\u0254\u0256-\u0257\u0300-\u036f\u1e00-\u1eff\u0400-\u04ff\u0500-\u0527\u2de0-\u2dff\ua640-\ua69f\u0591-\u05bf\u05c1-\u05c2\u05c4-\u05c5\u05d0-\u05ea\u05f0-\u05f4\ufb12-\ufb28\ufb2a-\ufb36\ufb38-\ufb3c\ufb40-\ufb41\ufb43-\ufb44\ufb46-\ufb4f\u0610-\u061a\u0620-\u065f\u066e-\u06d3\u06d5-\u06dc\u06de-\u06e8\u06ea-\u06ef\u06fa-\u06fc\u0750-\u077f\u08a2-\u08ac\u08e4-\u08fe\ufb50-\ufbb1\ufbd3-\ufd3d\ufd50-\ufd8f\ufd92-\ufdc7\ufdf0-\ufdfb\ufe70-\ufe74\ufe76-\ufefc\u200c-\u200c\u0e01-\u0e3a\u0e40-\u0e4e\u1100-\u11ff\u3130-\u3185\ua960-\ua97f\uac00-\ud7af\ud7b0-\ud7ff\uffa1-\uffdc\u30a1-\u30fa\u30fc-\u30fe\uff66-\uff9f\uff10-\uff19\uff21-\uff3a\uff41-\uff5a\u3041-\u3096\u3099-\u309e\u3400-\u4dbf\u4e00-\u9fff\u20000-\u2a6df\u2a700-\u2b73f\u2b740-\u2b81f\u2f800-\u2fa1f]*[a-z_\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u00ff\u0100-\u024f\u0253-\u0254\u0256-\u0257\u0300-\u036f\u1e00-\u1eff\u0400-\u04ff\u0500-\u0527\u2de0-\u2dff\ua640-\ua69f\u0591-\u05bf\u05c1-\u05c2\u05c4-\u05c5\u05d0-\u05ea\u05f0-\u05f4\ufb12-\ufb28\ufb2a-\ufb36\ufb38-\ufb3c\ufb40-\ufb41\ufb43-\ufb44\ufb46-\ufb4f\u0610-\u061a\u0620-\u065f\u066e-\u06d3\u06d5-\u06dc\u06de-\u06e8\u06ea-\u06ef\u06fa-\u06fc\u0750-\u077f\u08a2-\u08ac\u08e4-\u08fe\ufb50-\ufbb1\ufbd3-\ufd3d\ufd50-\ufd8f\ufd92-\ufdc7\ufdf0-\ufdfb\ufe70-\ufe74\ufe76-\ufefc\u200c-\u200c\u0e01-\u0e3a\u0e40-\u0e4e\u1100-\u11ff\u3130-\u3185\ua960-\ua97f\uac00-\ud7af\ud7b0-\ud7ff\uffa1-\uffdc\u30a1-\u30fa\u30fc-\u30fe\uff66-\uff9f\uff10-\uff19\uff21-\uff3a\uff41-\uff5a\u3041-\u3096\u3099-\u309e\u3400-\u4dbf\u4e00-\u9fff\u20000-\u2a6df\u2a700-\u2b73f\u2b740-\u2b81f\u2f800-\u2fa1f][a-z0-9_\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u00ff\u0100-\u024f\u0253-\u0254\u0256-\u0257\u0300-\u036f\u1e00-\u1eff\u0400-\u04ff\u0500-\u0527\u2de0-\u2dff\ua640-\ua69f\u0591-\u05bf\u05c1-\u05c2\u05c4-\u05c5\u05d0-\u05ea\u05f0-\u05f4\ufb12-\ufb28\ufb2a-\ufb36\ufb38-\ufb3c\ufb40-\ufb41\ufb43-\ufb44\ufb46-\ufb4f\u0610-\u061a\u0620-\u065f\u066e-\u06d3\u06d5-\u06dc\u06de-\u06e8\u06ea-\u06ef\u06fa-\u06fc\u0750-\u077f\u08a2-\u08ac\u08e4-\u08fe\ufb50-\ufbb1\ufbd3-\ufd3d\ufd50-\ufd8f\ufd92-\ufdc7\ufdf0-\ufdfb\ufe70-\ufe74\ufe76-\ufefc\u200c-\u200c\u0e01-\u0e3a\u0e40-\u0e4e\u1100-\u11ff\u3130-\u3185\ua960-\ua97f\uac00-\ud7af\ud7b0-\ud7ff\uffa1-\uffdc\u30a1-\u30fa\u30fc-\u30fe\uff66-\uff9f\uff10-\uff19\uff21-\uff3a\uff41-\uff5a\u3041-\u3096\u3099-\u309e\u3400-\u4dbf\u4e00-\u9fff\u20000-\u2a6df\u2a700-\u2b73f\u2b740-\u2b81f\u2f800-\u2fa1f]*)/gi

I found this : "Need help parsing tweet text?", on dev.twitter.com

Take a look on the Twitter text processing library we’re using for auto linking and extraction of usernames, lists & hashtags.

(there's ruby, java and javascript librairies)

They are quite enormous, as twitter must take into account every possible case.

this is what I use, the closest i get:

/#(\w*[0-9a-zA-Z]+\w*[0-9a-zA-Z])/g

link of the hashtag Regex to test

Based on how the official Twitter client for Mac highlights hashtags, I suspect the rule is any sequence of contiguous letters, numbers, or underlines following a hash. In other words, it's as simple as the regex /#\w+/ (assuming a unicode-aware regex engine).

The Twitter entity parsing libraries are available here: https://github.com/twitter/twitter-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!