问题
I have the folowing dataset:
d = {'player': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '2', '2',
'2', '2', '2', '2', '3', '3', '3', '3', '3'],
'session': ['a', 'a', 'b', np.nan, 'b', 'c', 'c', 'c', 'c', 'd', 'd',
'e', 'e', np.nan, 'e', 'f', 'f', 'g', np.nan, 'g'],
'date': ['2018-01-01 00:19:05', '2018-01-01 00:21:07',
'2018-01-01 00:22:07', '2018-01-01 00:22:15','2018-01-01 00:25:09',
'2018-01-01 00:25:11', '2018-01-01 00:27:28', '2018-01-01 00:29:29',
'2018-01-01 00:30:35', '2018-01-01 00:21:16', '2018-01-01 00:35:22',
'2018-01-01 00:38:16', '2018-01-01 00:38:20', '2018-01-01 00:40:35',
'2018-01-01 01:31:16', '2018-01-03 00:55:22', '2018-01-03 00:58:16',
'2018-01-03 00:58:21', '2018-03-01 01:00:35', '2018-03-01 01:31:16']
}
#create dataframe
df = pd.DataFrame(data=d)
#change date to datetime
df['date'] = pd.to_datetime(df['date'])
df.head()
player session date
0 1 a 2018-01-01 00:19:05
1 1 a 2018-01-01 00:21:07
2 1 b 2018-01-01 00:22:07
3 1 NaN 2018-01-01 00:22:15
4 1 b 2018-01-01 00:25:09
So, these are my three columns:
- 'player' - with three players (1,2,3) - dtype = object
- 'session' (object). Each session id groups together a set of actions (i.e. the rows in the dataset) that the players have implemented online.
- 'date' (datetime object) tells us the time at which each action was implemented.
The problem in this dataset is that I have the timestamps for each action, but some of the actions are missing their session id. What I want to do is the following: for each player, I want to give an id label for the missing values, based on the timeline. The actions missing their id can be labeled if they fall within the temporal range (first action - last action) of a certain session.
Let's say I groupby player & id, and compute the time range for each session:
my_agg = df.groupby(['player', 'session']).date.agg([min, max])
my_agg
min max
player session
1 a 2018-01-01 00:19:05 2018-01-01 00:21:07
b 2018-01-01 00:22:07 2018-01-01 00:25:09
c 2018-01-01 00:25:11 2018-01-01 00:30:35
2 d 2018-01-01 00:21:16 2018-01-01 00:35:22
e 2018-01-01 00:38:16 2018-01-01 01:31:16
3 f 2018-01-03 00:55:22 2018-01-03 00:58:16
g 2018-01-03 00:58:21 2018-03-01 01:31:16
At this point I would like to iterate through every player, and to compare the timestamp of my nan values, session by session, to see where they belong.
Desired output: In the example, the first Nan should be labeled as 'b', the second one as 'e' and the last one as 'g'.
Disclaimer: I asked a similar question a few days ago (see here), and received a very good answer, but this time I must take into account another variable and I am again stuck. Indeed, the first steps in Python are exciting but very challenging.
回答1:
Your example is already sorted, however this should produce your desired result even in the event that your inputs are not sorted. If this answer does not satisfy your requirements, please post an additional (or modified) sample dataframe with an expected output where this does violate your requirements.
df.sort_values(['player','date']).fillna(method='ffill')
Yields:
player session date
0 1 a 2018-01-01 00:19:05
1 1 a 2018-01-01 00:21:07
2 1 b 2018-01-01 00:22:07
3 1 b 2018-01-01 00:22:15
4 1 b 2018-01-01 00:25:09
5 1 c 2018-01-01 00:25:11
6 1 c 2018-01-01 00:27:28
7 1 c 2018-01-01 00:29:29
8 1 c 2018-01-01 00:30:35
9 2 d 2018-01-01 00:21:16
10 2 d 2018-01-01 00:35:22
11 2 e 2018-01-01 00:38:16
12 2 e 2018-01-01 00:38:20
13 2 e 2018-01-01 00:40:35
14 2 e 2018-01-01 01:31:16
15 3 f 2018-01-03 00:55:22
16 3 f 2018-01-03 00:58:16
17 3 g 2018-01-03 00:58:21
18 3 g 2018-03-01 01:00:35
19 3 g 2018-03-01 01:31:16
来源:https://stackoverflow.com/questions/52104260/pandas-filling-missing-values-iterating-through-a-groupby-object