Pandas: filling missing values iterating through a groupby object

为君一笑 提交于 2019-12-11 02:42:02

问题


I have the folowing dataset:

d = {'player': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '2', '2', 
'2', '2', '2', '2', '3', '3', '3', '3', '3'],
'session': ['a', 'a', 'b', np.nan, 'b', 'c', 'c', 'c', 'c', 'd', 'd', 
'e', 'e', np.nan, 'e', 'f', 'f', 'g', np.nan,  'g'],
'date': ['2018-01-01 00:19:05', '2018-01-01 00:21:07', 
'2018-01-01 00:22:07', '2018-01-01 00:22:15','2018-01-01 00:25:09', 
'2018-01-01 00:25:11', '2018-01-01 00:27:28', '2018-01-01 00:29:29', 
'2018-01-01 00:30:35', '2018-01-01 00:21:16', '2018-01-01 00:35:22', 
'2018-01-01 00:38:16', '2018-01-01 00:38:20', '2018-01-01 00:40:35', 
'2018-01-01 01:31:16', '2018-01-03 00:55:22', '2018-01-03 00:58:16', 
'2018-01-03 00:58:21', '2018-03-01 01:00:35', '2018-03-01 01:31:16']
}

#create dataframe
df = pd.DataFrame(data=d)
#change date to datetime
df['date'] =  pd.to_datetime(df['date']) 

df.head()

     player session        date
0       1       a 2018-01-01 00:19:05
1       1       a 2018-01-01 00:21:07
2       1       b 2018-01-01 00:22:07
3       1     NaN 2018-01-01 00:22:15
4       1       b 2018-01-01 00:25:09

So, these are my three columns:

  1. 'player' - with three players (1,2,3) - dtype = object
  2. 'session' (object). Each session id groups together a set of actions (i.e. the rows in the dataset) that the players have implemented online.
  3. 'date' (datetime object) tells us the time at which each action was implemented.

The problem in this dataset is that I have the timestamps for each action, but some of the actions are missing their session id. What I want to do is the following: for each player, I want to give an id label for the missing values, based on the timeline. The actions missing their id can be labeled if they fall within the temporal range (first action - last action) of a certain session.

Let's say I groupby player & id, and compute the time range for each session:

my_agg = df.groupby(['player', 'session']).date.agg([min, max])
my_agg

                           min                 max
player session                                        
1      a       2018-01-01 00:19:05 2018-01-01 00:21:07
       b       2018-01-01 00:22:07 2018-01-01 00:25:09
       c       2018-01-01 00:25:11 2018-01-01 00:30:35
2      d       2018-01-01 00:21:16 2018-01-01 00:35:22
       e       2018-01-01 00:38:16 2018-01-01 01:31:16
3      f       2018-01-03 00:55:22 2018-01-03 00:58:16
       g       2018-01-03 00:58:21 2018-03-01 01:31:16

At this point I would like to iterate through every player, and to compare the timestamp of my nan values, session by session, to see where they belong.

Desired output: In the example, the first Nan should be labeled as 'b', the second one as 'e' and the last one as 'g'.

Disclaimer: I asked a similar question a few days ago (see here), and received a very good answer, but this time I must take into account another variable and I am again stuck. Indeed, the first steps in Python are exciting but very challenging.


回答1:


Your example is already sorted, however this should produce your desired result even in the event that your inputs are not sorted. If this answer does not satisfy your requirements, please post an additional (or modified) sample dataframe with an expected output where this does violate your requirements.

df.sort_values(['player','date']).fillna(method='ffill')

Yields:

   player session                date
0       1       a 2018-01-01 00:19:05
1       1       a 2018-01-01 00:21:07
2       1       b 2018-01-01 00:22:07
3       1       b 2018-01-01 00:22:15
4       1       b 2018-01-01 00:25:09
5       1       c 2018-01-01 00:25:11
6       1       c 2018-01-01 00:27:28
7       1       c 2018-01-01 00:29:29
8       1       c 2018-01-01 00:30:35
9       2       d 2018-01-01 00:21:16
10      2       d 2018-01-01 00:35:22
11      2       e 2018-01-01 00:38:16
12      2       e 2018-01-01 00:38:20
13      2       e 2018-01-01 00:40:35
14      2       e 2018-01-01 01:31:16
15      3       f 2018-01-03 00:55:22
16      3       f 2018-01-03 00:58:16
17      3       g 2018-01-03 00:58:21
18      3       g 2018-03-01 01:00:35
19      3       g 2018-03-01 01:31:16


来源:https://stackoverflow.com/questions/52104260/pandas-filling-missing-values-iterating-through-a-groupby-object

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!