Extract hashtags from columns of a pandas dataframe

一世执手 提交于 2019-12-12 21:21:44

问题


i have a dataframe df. I want to extract hashtags from tweets where Max==45.:

Max    Tweets
42   via @VIE_unlike at #fashion
42   Ny trailer #katamaritribute #ps3
45   Saved a baby bluejay from dogs #fb
45   #Niley #Niley #Niley 

i m trying something like this but its giving empty dataframe:

df.loc[df['Max'] == 45, [hsh for hsh in 'tweets' if hsh.startswith('#')]]

is there something in pandas which i can use to perform this effectively and faster.


回答1:


You can use pd.Series.str.findall:

In [956]: df.Tweets.str.findall(r'#.*?(?=\s|$)')
Out[956]: 
0                  [#fashion]
1    [#katamaritribute, #ps3]
2                       [#fb]
3    [#Niley, #Niley, #Niley]

This returns a column of lists.

If you want to filter first and then find, you can do so quite easily using boolean indexing:

In [957]: df.Tweets[df.Max == 45].str.findall(r'#.*?(?=\s|$)')
Out[957]: 
2                       [#fb]
3    [#Niley, #Niley, #Niley]
Name: Tweets, dtype: object

The regex used here is:

#.*?(?=\s|$)

To understand it, break it down:

  • #.*? - carries out a non-greedy match for a word starting with a hashtag
  • (?=\s|$) - lookahead for the end of the word or end of the sentence

If it's possible you have # in the middle of a word that is not a hashtag, that would yield false positives which you wouldn't want. In that case, You can modify your regex to include a lookbehind:

(?:(?<=\s)|(?<=^))#.*?(?=\s|$)

The regex lookbehind asserts that either a space or the start of the sentence must precede a # character.



来源:https://stackoverflow.com/questions/45874879/extract-hashtags-from-columns-of-a-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!