Python Regex expression for extracting hashtags from text

问题

I'm processing some tweets I mined during the election and I need to a way to extract hashtags from tweet text while accounting punctuation, non-unicode characters, etc while still retaining the hashtag in the outputted list.

For example, the orignal text from a tweet looks like:

I'm with HER! #NeverTrump #DumpTrump #imwithherðŸ‡ºðŸ‡¸ @ Williamsburg, Brooklyn

and when turned into a string in python (or even put into a code block on this site), the special characters near the end are changed, producing this:

"I'm with HER! #NeverTrump #DumpTrump #imwithherdY\xd8\xa7dY\xd8, @ Williamsburg, Brooklyn"

now I would like to parse the string to be turned into a list like this:

['#NeverTrump','#DumpTrump', '#imwithher']

I'm currently using this expression where str is the above string:

tokenizedTweet = re.findall(r'(?i)\#\w+', str, flags=re.UNICODE)

however, I'm getting this as output:

['#NeverTrump', '#DumpTrump', '#imwithherdY\xd8']

How would I account for 'dY\xd8' in my regex to exclude it? I'm also open to other solutions not involving regex.

回答1:

Yah, about the solution not involving regex. ;)

# -*- coding: utf-8 -*-
import string 
tweets = []

a = "I'm with HER! #NeverTrump #DumpTrump #imwithherðŸ‡ºðŸ‡¸ @ Williamsburg, Brooklyn"

# filter for printable characters then
a = ''.join(filter(lambda x: x in string.printable, a))

print a

for tweet in a.split(' '):
    if tweet.startswith('#'):
        tweets.append(tweet.strip(','))

print tweets

and tada: ['#NeverTrump', '#DumpTrump', '#imwithher']

来源：https://stackoverflow.com/questions/40622037/python-regex-expression-for-extracting-hashtags-from-text

标签

python

regex

twitter