问题
I want to extract hashtags for my sentiment analysis project, however I'm getting a list of dictionary containing all the hashtags along with their indices in the tweet. I only want the text.
My code :
data = tweepy.Cursor(api.search, q, since=a[i], until=b[i]).items()
tweet_data = []
tweets = pd.DataFrame()
tweets['Tweet_ID'] = map(lambda tweet: tweet['id'], tweet_data)
tweets['Tweet'] = map(lambda tweet: tweet['text'].encode('utf-8'), tweet_data)
tweets['Date'] = map(lambda tweet: time.strftime('%Y-%m-%d %H:%M:%S', time.strptime(tweet['created_at'],'%a %b %d %H:%M:%S +0000 %Y')), tweet_data)
tweets['User'] = map(lambda tweet: tweet['user']['screen_name'], tweet_data)
tweets['Follower_count'] = map(lambda tweet: tweet['user']['followers_count'], tweet_data)
tweets['Hashtags']=map(lambda tweet: tweet['entities']['hashtags'], tweet_data)
Current Output :
df=pd.DataFrame({'Hashtags' : [{u'indices': [53, 65], u'text': u'Predictions'}, {u'indices': [67, 76], u'text': u'FreeTips'}, {u'indices': [78, 89], u'text': u'SoccerTips'}, {u'indices': [90, 103], u'text': u'FootballTips'}, {u'indices': [104, 110], u'text': u'Goals'}]})
Expected Output :
df=pd.DataFrame({'Hashtags' :["u'Predictions'", "u'SoccerTips'", "u'FootballTips'", "u'Goals'"]})
I've tried to use several methods to flatten/reduce/access a nested dictionary containing list of dictionaries. Please help.
Error :
as @MSeifert suggested, I've tried his method. The following error was generated:
dt=tweet.entities.hashtags
pd.io.json.json_normalize(dt, 'hashtags')
pd.io.json.json_normalize(dt, 'hashtags')['text'].tolist()
Traceback (most recent call last): <\br>
File "<ipython-input-166-be11241611d6>", line 1, in <module>
dt=tweet.entities.hashtags
AttributeError: 'dict' object has no attribute 'entities'
I've also tried doing this :-
dx = tweets['Hashtags']
for key, value in dx.items():
print key, value
With the following error :-
Traceback (most recent call last):
File "<ipython-input-167-d66c278ec072>", line 2, in <module>
for key, value in dx.items():
File "C:\ANACONDA\lib\site-packages\pandas\core\generic.py", line 2740, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'items'
UPDATE :
I'm able to access the text part of the nested hashtags dictionary
tweets['Hashtags'][1][1]['text']
Out[209]: u'INDvPAK'
I want to create a loop to append all the hashtags in the row.
回答1:
Instead of using the DataFrame
constructor you could use the json_normalize function:
>>> import pandas as pd
>>> d = {'Hashtags' :
... [{u'indices': [53, 65], u'text': u'Predictions'},
... {u'indices': [67, 76], u'text': u'FreeTips'},
... {u'indices': [78, 89], u'text': u'SoccerTips'},
... {u'indices': [90, 103], u'text': u'FootballTips'},
... {u'indices': [104, 110], u'text': u'Goals'}]}
>>> pd.io.json.json_normalize(d, 'Hashtags')
indices text
0 [53, 65] Predictions
1 [67, 76] FreeTips
2 [78, 89] SoccerTips
3 [90, 103] FootballTips
4 [104, 110] Goals
Then you could just use the 'text'
column:
>>> pd.io.json.json_normalize(d, 'Hashtags')['text'].tolist()
[u'Predictions', u'FreeTips', u'SoccerTips', u'FootballTips', u'Goals']
回答2:
Here's the solution :
After troubleshooting and trying various methods for a lot of time, I finally figured out how to split the nested dictionary. It is a fairly simple loop. I noticed that we can access the hashtag text by
tweets['Hashtags'][1][1]['text']
Out[209]: u'INDvPAK'
This was a valuable insight as i got to know I DON'T need to mention u'text
as my index. text
will be used.
Code :
ht=[]
for s in range(len(tweets['Hashtags'])):
hasht=[]
for t in range(len(tweets.Hashtags[s])):
#zx = tweets['Hashtags'][s][t]['text']
hasht.append(tweets['Hashtags'][s][t]['text'])
t=t+1
ht.append(hasht)
s=s+1
tweets['HT']=zip(ht)
This is a simple nested for loop which iterates through first the inner key values in the { "Indices" : [], "u'text'" : []}
and then iterates through the list of dictionaries under ["entities" : { "Hashtags" : [{1},{2},{3}]}]
Finally I used zip()
to zip the lists of hashtags for a single row/user.
OUTPUT :
([u'SoccerTips', u'FootballTips'],)
This can be easily splitted.
来源:https://stackoverflow.com/questions/44700371/how-to-extract-only-texts-in-hashtag-using-tweepy