问题
I'm processing some tweets I mined during the election and I need to a way to extract hashtags from tweet text while accounting punctuation, non-unicode characters, etc while still retaining the hashtag in the outputted list.
For example, the orignal text from a tweet looks like:
I'm with HER! #NeverTrump #DumpTrump #imwithher🇺🇸 @ Williamsburg, Brooklyn
and when turned into a string in python (or even put into a code block on this site), the special characters near the end are changed, producing this:
"I'm with HER! #NeverTrump #DumpTrump #imwithherdY\xd8\xa7dY\xd8, @ Williamsburg, Brooklyn"
now I would like to parse the string to be turned into a list like this:
['#NeverTrump','#DumpTrump', '#imwithher']
I'm currently using this expression where str is the above string:
tokenizedTweet = re.findall(r'(?i)\#\w+', str, flags=re.UNICODE)
however, I'm getting this as output:
['#NeverTrump', '#DumpTrump', '#imwithherdY\xd8']
How would I account for 'dY\xd8' in my regex to exclude it? I'm also open to other solutions not involving regex.
回答1:
Yah, about the solution not involving regex. ;)
# -*- coding: utf-8 -*-
import string
tweets = []
a = "I'm with HER! #NeverTrump #DumpTrump #imwithher🇺🇸 @ Williamsburg, Brooklyn"
# filter for printable characters then
a = ''.join(filter(lambda x: x in string.printable, a))
print a
for tweet in a.split(' '):
if tweet.startswith('#'):
tweets.append(tweet.strip(','))
print tweets
and tada: ['#NeverTrump', '#DumpTrump', '#imwithher']
来源:https://stackoverflow.com/questions/40622037/python-regex-expression-for-extracting-hashtags-from-text