Python Regex expression for extracting hashtags from text

此生再无相见时 提交于 2019-12-25 09:17:08

问题


I'm processing some tweets I mined during the election and I need to a way to extract hashtags from tweet text while accounting punctuation, non-unicode characters, etc while still retaining the hashtag in the outputted list.

For example, the orignal text from a tweet looks like:

I'm with HER! #NeverTrump #DumpTrump #imwithher🇺🇸 @ Williamsburg, Brooklyn

and when turned into a string in python (or even put into a code block on this site), the special characters near the end are changed, producing this:

"I'm with HER! #NeverTrump #DumpTrump #imwithherdY\xd8\xa7dY\xd8, @ Williamsburg, Brooklyn"

now I would like to parse the string to be turned into a list like this:

['#NeverTrump','#DumpTrump', '#imwithher']

I'm currently using this expression where str is the above string:

tokenizedTweet = re.findall(r'(?i)\#\w+', str, flags=re.UNICODE)

however, I'm getting this as output:

['#NeverTrump', '#DumpTrump', '#imwithherdY\xd8']

How would I account for 'dY\xd8' in my regex to exclude it? I'm also open to other solutions not involving regex.


回答1:


Yah, about the solution not involving regex. ;)

# -*- coding: utf-8 -*-
import string 
tweets = []

a = "I'm with HER! #NeverTrump #DumpTrump #imwithher🇺🇸 @ Williamsburg, Brooklyn"

# filter for printable characters then
a = ''.join(filter(lambda x: x in string.printable, a))

print a

for tweet in a.split(' '):
    if tweet.startswith('#'):
        tweets.append(tweet.strip(','))

print tweets

and tada: ['#NeverTrump', '#DumpTrump', '#imwithher']



来源:https://stackoverflow.com/questions/40622037/python-regex-expression-for-extracting-hashtags-from-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!