问题
i have a dataframe df
. I want to extract hashtags from tweets where Max==45.:
Max Tweets
42 via @VIE_unlike at #fashion
42 Ny trailer #katamaritribute #ps3
45 Saved a baby bluejay from dogs #fb
45 #Niley #Niley #Niley
i m trying something like this but its giving empty dataframe:
df.loc[df['Max'] == 45, [hsh for hsh in 'tweets' if hsh.startswith('#')]]
is there something in pandas which i can use to perform this effectively and faster.
回答1:
You can use pd.Series.str.findall
:
In [956]: df.Tweets.str.findall(r'#.*?(?=\s|$)')
Out[956]:
0 [#fashion]
1 [#katamaritribute, #ps3]
2 [#fb]
3 [#Niley, #Niley, #Niley]
This returns a column of list
s.
If you want to filter first and then find, you can do so quite easily using boolean indexing
:
In [957]: df.Tweets[df.Max == 45].str.findall(r'#.*?(?=\s|$)')
Out[957]:
2 [#fb]
3 [#Niley, #Niley, #Niley]
Name: Tweets, dtype: object
The regex used here is:
#.*?(?=\s|$)
To understand it, break it down:
#.*?
- carries out a non-greedy match for a word starting with a hashtag(?=\s|$)
- lookahead for the end of the word or end of the sentence
If it's possible you have #
in the middle of a word that is not a hashtag, that would yield false positives which you wouldn't want. In that case, You can modify your regex to include a lookbehind:
(?:(?<=\s)|(?<=^))#.*?(?=\s|$)
The regex lookbehind asserts that either a space or the start of the sentence must precede a #
character.
来源:https://stackoverflow.com/questions/45874879/extract-hashtags-from-columns-of-a-pandas-dataframe