Parsing a tweet to extract hashtags into an array

前端 未结 9 721
旧时难觅i
旧时难觅i 2020-12-03 05:33

I am having a heck of a time taking the information in a tweet including hashtags, and pulling each hashtag into an array using Python. I am embarrassed to even put what I

相关标签:
9条回答
  • 2020-12-03 06:09

    AndiDogs answer will screw up with links and other stuff, you may want to filter them out first. After that use this code:

    UTF_CHARS = ur'a-z0-9_\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u00ff'
    TAG_EXP = ur'(^|[^0-9A-Z&/]+)(#|\uff03)([0-9A-Z_]*[A-Z_]+[%s]*)' % UTF_CHARS
    TAG_REGEX = re.compile(TAG_EXP, re.UNICODE | re.IGNORECASE)
    

    It may seem overkill but this has been converted from here http://github.com/mzsanford/twitter-text-java. It will handle like 99% of all hashtags in the same way that twitter handles them.

    For more converted twitter regex check out this: http://github.com/BonsaiDen/Atarashii/blob/master/atarashii/usr/share/pyshared/atarashii/formatter.py

    EDIT:
    Check out: http://github.com/BonsaiDen/AtarashiiFormat

    0 讨论(0)
  • 2020-12-03 06:12

    A simple regex should do the job:

    >>> import re
    >>> s = "I love #stackoverflow because #people are very #helpful!"
    >>> re.findall(r"#(\w+)", s)
    ['stackoverflow', 'people', 'helpful']
    

    Note though, that as suggested in other answers, this may also find non-hashtags, such as a hash location in a URL:

    >>> re.findall(r"#(\w+)", "http://example.org/#comments")
    ['comments']
    

    So another simple solution would be the following (removes duplicates as a bonus):

    >>> def extract_hash_tags(s):
    ...    return set(part[1:] for part in s.split() if part.startswith('#'))
    ...
    >>> extract_hash_tags("#test http://example.org/#comments #test")
    set(['test'])
    
    0 讨论(0)
  • 2020-12-03 06:19
    hashtags = [word for word in tweet.split() if word[0] == "#"]
    
    0 讨论(0)
提交回复
热议问题