问题
I have to create a multi index data frame condisering data contained in two different data frames. For each index of the second data frame (Date), for each row of the first data frame, if the value in the column Date of the first data frame is equal to the index of the second data frame then create me a multi index dataframe with each date, the number of tweets published each day and the features of each row.
This is the first data frame with Datas from Twitter:
Date Full text Retweets Likes
333 2018-04-13 RT @Tesla... 2838 0
332 2018-04-13 @timkhiggins... 7722 40733
331 2018-04-13 @TheEconomist.. 1911 18634
This is the second data frame with Datas from Tesla stock market:
Open High Low Close Volume Gap
Date
2018-04-13 283.000000 296.859985 279.519989 294.089996 8569400 11.089996
2018-04-14 303.320000 304.940002 291.619995 291.970001 7286800 -11.349999
2018-04-25 287.760010 288.000000 273.420013 275.010010 8945800 -12.750000
This is what I have tried to do:
for i in TeslaData.index:
for row in sortedTweetsData.iterrows():
if row[1]==i:
NumTweetsByDay+=1
for num in NumTweetsByDay:
idx=pd.MultiIndex.from_product([[i],[NumTweetsBy]])
colum=col
df= pd.DataFrame(row,idx,column)
The output that I am looking for is the following one:
Date Number of Tweets Full text Retweets Likes
2018-04-13 1 RT @Tesla... 2838 0
2 @timkhiggins... 7722 40733
3 @TheEconomist.. 1911 18634
回答1:
If I understand correctly, you want to filter twitter data by date if there is an entry in the stock dataset for the same date.
You can do this with isin():
# convert datatypes first:
sortedTweetsData['Date'] = pd.to_datetime(sortedTweetsData['Date'])
TeslaData.index = pd.to_datetime(TeslaData.index)
# do filtering
df = sortedTweetsData[sortedTweetsData['Date'].isin(TeslaData.index.values)]
next, you can determine how many tweets each group has:
groupsizes = df.groupby(by='Date').size()
and use that to build a list of tuples, to define your multiindex (there is likely a more elegant way to do this):
tups = [(ix, gs + 1) for ix in groupsizes.index.values for gs in range(groupsizes[ix])]
finally:
df.index = pd.MultiIndex.from_tuples(tups, names=['Date', 'Number of Tweets'])
来源:https://stackoverflow.com/questions/55758521/how-can-i-create-a-multiindex-data-frame-with-the-following-datasets