问题
Python newbie here, this is my first question. I tried to find a solution on similar SO questions, like this one, this one, and also this one, but I think my problem is different.
Here's my situation: I have a quite large dataset with two columns: Date (datetime object), and session_id (integer). The timestamps refer to the moment where a certain action occurred during an online session.
My problem is that I have all the dates, but I am missing some of the corresponding session_id values. What I would like to do is to fill these missing values using the date column:
- If the action occurred between the first and last date of a certain session, I would like to fill the missing value with the id of that session.
- I would mark as '0' the session where the action occurred outside the range of any session -
- and mark it as '-99' if it is not possible to associate the event to a single session, because it occurred during the time range of different session.
To give an example of my problem, let's consider the toy dataset below, where I have just three sessions: a, b, c. Session a and b registered three events, session c two. Moreover, I have three missing id values.
| DATE |sess_id|
----------------------------------
0 | 2018-01-01 00:19:01 | a |
1 | 2018-01-01 00:19:05 | b |
2 | 2018-01-01 00:21:07 | a |
3 | 2018-01-01 00:22:07 | b |
4 | 2018-01-01 00:25:09 | c |
5 | 2018-01-01 00:25:11 | Nan |
6 | 2018-01-01 00:27:28 | c |
7 | 2018-01-01 00:29:29 | a |
8 | 2018-01-01 00:30:35 | Nan |
9 | 2018-01-01 00:31:16 | b |
10 | 2018-01-01 00:35:22 | Nan |
...
[Image_Timeline example][1]
This is what I would like to obtain:
| DATE |sess_id|
----------------------------------
0 | 2018-01-01 00:19:01 | a |
1 | 2018-01-01 00:19:05 | b |
2 | 2018-01-01 00:21:07 | a |
3 | 2018-01-01 00:22:07 | b |
4 | 2018-01-01 00:25:09 | c |
5 | 2018-01-01 00:25:11 | -99 |
6 | 2018-01-01 00:27:28 | c |
7 | 2018-01-01 00:29:29 | a |
8 | 2018-01-01 00:30:35 | b |
9 | 2018-01-01 00:31:16 | b |
10 | 2018-01-01 00:35:22 | 0 |
...
In this way I will be able to recover at least some of the events without session code. I think that maybe the first thing to do is to compute two new columns showing the first and last time value for each session, something like that:
foo['last'] = foo.groupby('sess_id')['DATE'].transform(max)
foo['firs'] = foo.groupby('SESSIONCODE')['DATE'].transform(min)
And then use first-last time value to check whether each event whose session id is unknown falls withing that range.
回答1:
Your intuition seems fine by me, but you can't apply it this way since your dataframe foo
doens't have the same size as your groupby
dataframe. What you could do is map the values like this:
foo['last'] = foo.sess_id.map(foo.groupby('sess_id').DATE.max())
foo['first'] = foo.sess_id.map(foo.groupby('sess_id').DATE.min())
But I don't think it's necessary, you can just use the groupby dataframe as such.
A way to solve your problem could be to look for the missing values in sess_id
column, and apply a custom function to the corresponding dates:
def my_custom_function(time):
current_sessions = my_agg.loc[(my_agg['min']<time) & (my_agg['max']>time)]
count = len(current_sessions)
if count == 0:
return 0
if count > 1:
return -99
return current_sessions.index[0]
my_agg = foo.groupby('sess_id').DATE.agg([min,max])
foo.loc[foo.sess_id.isnull(),'sess_id'] = foo.loc[foo.sess_id.isnull(),'DATE'].apply(my_custom_function)
Output:
DATE sess_id
0 2018-01-01 00:19:01 a
1 2018-01-01 00:19:05 b
2 2018-01-01 00:21:07 a
3 2018-01-01 00:22:07 b
4 2018-01-01 00:25:09 c
5 2018-01-01 00:25:11 -99
6 2018-01-01 00:27:28 c
7 2018-01-01 00:29:29 a
8 2018-01-01 00:30:35 b
9 2018-01-01 00:31:16 b
10 2018-01-01 00:35:22 0
I think it performs what you are looking for, though the output you posted in your question seems to contain typos.
来源:https://stackoverflow.com/questions/51984239/pandas-fill-missing-values-of-a-column-based-on-the-datetime-values-of-another-c