问题
From the following sort of dataframe I would like to be able to both sort and rank the id
field on date:
df = pd.DataFrame({
'id':[1, 1, 2, 3, 3, 4, 5, 6,6,6,7,7],
'value':[.01, .4, .2, .3, .11, .21, .4, .01, 3, .5, .8, .9],
'date':['10/01/2017 15:45:00','05/01/2017 15:56:00',
'11/01/2017 15:22:00','06/01/2017 11:02:00','05/01/2017 09:37:00',
'05/01/2017 09:55:00','05/01/2017 10:08:00','03/02/2017 08:55:00',
'03/02/2017 09:15:00','03/02/2017 09:31:00','09/01/2017 15:42:00',
'19/01/2017 16:34:00']})
to effectively rank or index, per id
, based on date.
I've used
df.groupby('id')['date'].min()
which allows me to extract the first date (although I don't know how to use this to filter out the rows) but I might not always need the first date - sometimes it will be the second or third so I need to generate a new column, with an index for the date - the result would look like:
Any ideas on this sorting/ranking/labelling?
EDIT
My original model ignored a very prevalent issue.
As there are feasibly some id
s that have multiple tests performed on them in parallel, therefore they show in multiple rows in the datebase, with matching dates (date
corresponds to when they were logged). These should be counted as the same date and not increment the date_rank: I've generated a model, with updated date_rank
to demonstrate how this would look:
df = pd.DataFrame({
'id':[1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6,6,6,7,7],
'value':[.01, .4, .5, .7, .77, .1,.2, 0.3, .11, .21, .4, .01, 3, .5, .8, .9, .1],
'date':['10/01/2017 15:45:00','10/01/2017 15:45:00','05/01/2017 15:56:00',
'11/01/2017 15:22:00','11/01/2017 15:22:00','06/01/2017 11:02:00','05/01/2017 09:37:00','05/01/2017 09:37:00','05/01/2017 09:55:00',
'05/01/2017 09:55:00','05/01/2017 10:08:00','05/01/2017 10:09:00','03/02/2017 08:55:00',
'03/02/2017 09:15:00','03/02/2017 09:31:00','09/01/2017 15:42:00',
'19/01/2017 16:34:00']})
And the counter would afford this:
回答1:
You can try of sorting date values in descending and aggregating the 'id' group values
@praveen's logic is very simpler, by extending of logic, you can use astype of category to convert the values to categories and can retrive the codes (keys') of that categories, but it will be little bit different to your expected output
df1 = df.sort_values(['id', 'date'], ascending=[True, False])
df1['date_rank'] =df1.groupby(['id']).apply(lambda x: x['date'].astype('category',ordered=False).cat.codes+1).values
Out:
date id value date_rank
0 10/01/2017 15:45:00 1 0.01 2
1 10/01/2017 15:45:00 1 0.40 2
2 05/01/2017 15:56:00 1 0.50 1
3 11/01/2017 15:22:00 2 0.70 1
4 11/01/2017 15:22:00 2 0.77 1
5 06/01/2017 11:02:00 3 0.10 2
6 05/01/2017 09:37:00 3 0.20 1
7 05/01/2017 09:37:00 3 0.30 1
8 05/01/2017 09:55:00 4 0.11 1
9 05/01/2017 09:55:00 4 0.21 1
11 05/01/2017 10:09:00 5 0.01 2
10 05/01/2017 10:08:00 5 0.40 1
14 03/02/2017 09:31:00 6 0.80 3
13 03/02/2017 09:15:00 6 0.50 2
12 03/02/2017 08:55:00 6 3.00 1
16 19/01/2017 16:34:00 7 0.10 2
15 09/01/2017 15:42:00 7 0.90 1
but to get your exact output, here i have used dictionary and reversing dictionary keys with extracting values
df1 = df.sort_values(['id', 'date'], ascending=[True, False])
df1['date_rank'] = df1.groupby(['id'])['date'].transform(lambda x: list(map(lambda y: dict(map(reversed, dict(enumerate(x.unique())).items()))[y]+1,x)) )
Out:
date id value date_rank
0 10/01/2017 15:45:00 1 0.01 1
1 10/01/2017 15:45:00 1 0.40 1
2 05/01/2017 15:56:00 1 0.50 2
3 11/01/2017 15:22:00 2 0.70 1
4 11/01/2017 15:22:00 2 0.77 1
5 06/01/2017 11:02:00 3 0.10 1
6 05/01/2017 09:37:00 3 0.20 2
7 05/01/2017 09:37:00 3 0.30 2
8 05/01/2017 09:55:00 4 0.11 1
9 05/01/2017 09:55:00 4 0.21 1
11 05/01/2017 10:09:00 5 0.01 1
10 05/01/2017 10:08:00 5 0.40 2
14 03/02/2017 09:31:00 6 0.80 1
13 03/02/2017 09:15:00 6 0.50 2
12 03/02/2017 08:55:00 6 3.00 3
16 19/01/2017 16:34:00 7 0.10 1
15 09/01/2017 15:42:00 7 0.90 2
回答2:
You can do this by sort_values, groupby and cumcount
df['date_rank'] = df.sort_values(['id', 'date'], ascending=[True, False]).groupby(['id']).cumcount() + 1
demo
In [1]: df = pd.DataFrame({
...: 'id':[1, 1, 2, 3, 3, 4, 5, 6,6,6,7,7],
...: 'value':[.01, .4, .2, .3, .11, .21, .4, .01, 3, .5, .8, .9],
...: 'date':['10/01/2017 15:45:00','05/01/2017 15:56:00',
...: '11/01/2017 15:22:00','06/01/2017 11:02:00','05/01/2017 09:37:00',
...: '05/01/2017 09:55:00','05/01/2017 10:08:00','03/02/2017 08:55:00',
...: '03/02/2017 09:15:00','03/02/2017 09:31:00','09/01/2017 15:42:00',
...: '19/01/2017 16:34:00']})
...:
In [2]: df['date_rank'] = df.sort_values(['id', 'date'], ascending=[True, False]).groupby(['id']).cumcount() + 1
...:
In [3]: df
Out[3]:
id value date date_rank
0 1 0.01 10/01/2017 15:45:00 1
1 1 0.40 05/01/2017 15:56:00 2
2 2 0.20 11/01/2017 15:22:00 1
3 3 0.30 06/01/2017 11:02:00 1
4 3 0.11 05/01/2017 09:37:00 2
5 4 0.21 05/01/2017 09:55:00 1
6 5 0.40 05/01/2017 10:08:00 1
7 6 0.01 03/02/2017 08:55:00 3
8 6 3.00 03/02/2017 09:15:00 2
9 6 0.50 03/02/2017 09:31:00 1
10 7 0.80 09/01/2017 15:42:00 2
11 7 0.90 19/01/2017 16:34:00 1
Edit
you can do this by rank method
df.groupby(['id'])['date'].rank(ascending=False, method='dense').astype(int)
demo
In [1]: df['rank'] = df.groupby(['id'])['date'].rank(ascending=False, method='dense').astype(int)
In [2]: df
Out[2]:
id value date rank
0 1 0.01 2017-10-01 15:45:00 1
1 1 0.40 2017-10-01 15:45:00 1
2 1 0.50 2017-05-01 15:56:00 2
3 2 0.70 2017-11-01 15:22:00 1
4 2 0.77 2017-11-01 15:22:00 1
5 3 0.10 2017-06-01 11:02:00 1
6 3 0.20 2017-05-01 09:37:00 2
7 3 0.30 2017-05-01 09:37:00 2
8 4 0.11 2017-05-01 09:55:00 1
9 4 0.21 2017-05-01 09:55:00 1
10 5 0.40 2017-05-01 10:08:00 2
11 5 0.01 2017-05-01 10:09:00 1
12 6 3.00 2017-03-02 08:55:00 3
13 6 0.50 2017-03-02 09:15:00 2
14 6 0.80 2017-03-02 09:31:00 1
15 7 0.90 2017-09-01 15:42:00 1
16 7 0.10 2017-01-19 16:34:00 2
来源:https://stackoverflow.com/questions/52661772/sorting-and-ranking-by-dates-on-a-group-in-a-pandas-df