Pandas fill missing values of a column based on the datetime values of another column

不想你离开。 提交于 2019-12-02 07:20:55

问题


Python newbie here, this is my first question. I tried to find a solution on similar SO questions, like this one, this one, and also this one, but I think my problem is different.

Here's my situation: I have a quite large dataset with two columns: Date (datetime object), and session_id (integer). The timestamps refer to the moment where a certain action occurred during an online session.

My problem is that I have all the dates, but I am missing some of the corresponding session_id values. What I would like to do is to fill these missing values using the date column:

  1. If the action occurred between the first and last date of a certain session, I would like to fill the missing value with the id of that session.
  2. I would mark as '0' the session where the action occurred outside the range of any session -
  3. and mark it as '-99' if it is not possible to associate the event to a single session, because it occurred during the time range of different session.

To give an example of my problem, let's consider the toy dataset below, where I have just three sessions: a, b, c. Session a and b registered three events, session c two. Moreover, I have three missing id values.

   |       DATE          |sess_id|
----------------------------------
 0 | 2018-01-01 00:19:01 | a    | 
 1 | 2018-01-01 00:19:05 | b    | 
 2 | 2018-01-01 00:21:07 | a    |
 3 | 2018-01-01 00:22:07 | b    | 
 4 | 2018-01-01 00:25:09 | c    |         
 5 | 2018-01-01 00:25:11 | Nan  |
 6 | 2018-01-01 00:27:28 | c    | 
 7 | 2018-01-01 00:29:29 | a    | 
 8 | 2018-01-01 00:30:35 | Nan  | 
 9 | 2018-01-01 00:31:16 | b    | 
10 | 2018-01-01 00:35:22 | Nan  | 
...

[Image_Timeline example][1]

This is what I would like to obtain:

   |       DATE          |sess_id|
----------------------------------
 0 | 2018-01-01 00:19:01 | a    | 
 1 | 2018-01-01 00:19:05 | b    | 
 2 | 2018-01-01 00:21:07 | a    |
 3 | 2018-01-01 00:22:07 | b    | 
 4 | 2018-01-01 00:25:09 | c    |         
 5 | 2018-01-01 00:25:11 | -99  |
 6 | 2018-01-01 00:27:28 | c    | 
 7 | 2018-01-01 00:29:29 | a    | 
 8 | 2018-01-01 00:30:35 | b    | 
 9 | 2018-01-01 00:31:16 | b    | 
10 | 2018-01-01 00:35:22 | 0    | 
...

In this way I will be able to recover at least some of the events without session code. I think that maybe the first thing to do is to compute two new columns showing the first and last time value for each session, something like that:

foo['last'] = foo.groupby('sess_id')['DATE'].transform(max) 
foo['firs'] = foo.groupby('SESSIONCODE')['DATE'].transform(min) 

And then use first-last time value to check whether each event whose session id is unknown falls withing that range.


回答1:


Your intuition seems fine by me, but you can't apply it this way since your dataframe foo doens't have the same size as your groupby dataframe. What you could do is map the values like this:

foo['last'] = foo.sess_id.map(foo.groupby('sess_id').DATE.max())
foo['first'] = foo.sess_id.map(foo.groupby('sess_id').DATE.min())

But I don't think it's necessary, you can just use the groupby dataframe as such.

A way to solve your problem could be to look for the missing values in sess_id column, and apply a custom function to the corresponding dates:

def my_custom_function(time):
    current_sessions = my_agg.loc[(my_agg['min']<time) & (my_agg['max']>time)]
    count = len(current_sessions)
    if count == 0:
        return 0
    if count > 1:
        return -99
    return current_sessions.index[0]

my_agg = foo.groupby('sess_id').DATE.agg([min,max])
foo.loc[foo.sess_id.isnull(),'sess_id'] = foo.loc[foo.sess_id.isnull(),'DATE'].apply(my_custom_function)

Output:

    DATE                    sess_id
0   2018-01-01 00:19:01     a
1   2018-01-01 00:19:05     b
2   2018-01-01 00:21:07     a
3   2018-01-01 00:22:07     b
4   2018-01-01 00:25:09     c
5   2018-01-01 00:25:11     -99
6   2018-01-01 00:27:28     c
7   2018-01-01 00:29:29     a
8   2018-01-01 00:30:35     b
9   2018-01-01 00:31:16     b
10  2018-01-01 00:35:22     0

I think it performs what you are looking for, though the output you posted in your question seems to contain typos.



来源:https://stackoverflow.com/questions/51984239/pandas-fill-missing-values-of-a-column-based-on-the-datetime-values-of-another-c

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!