Sorting and ranking by dates, on a group in a pandas df

问题

From the following sort of dataframe I would like to be able to both sort and rank the id field on date:

df = pd.DataFrame({
'id':[1, 1, 2, 3, 3, 4, 5, 6,6,6,7,7], 
'value':[.01, .4, .2, .3, .11, .21, .4, .01, 3, .5, .8, .9],
'date':['10/01/2017 15:45:00','05/01/2017 15:56:00',
        '11/01/2017 15:22:00','06/01/2017 11:02:00','05/01/2017 09:37:00',
        '05/01/2017 09:55:00','05/01/2017 10:08:00','03/02/2017 08:55:00',
        '03/02/2017 09:15:00','03/02/2017 09:31:00','09/01/2017 15:42:00',
        '19/01/2017 16:34:00']})

to effectively rank or index, per id, based on date.

I've used

df.groupby('id')['date'].min()

which allows me to extract the first date (although I don't know how to use this to filter out the rows) but I might not always need the first date - sometimes it will be the second or third so I need to generate a new column, with an index for the date - the result would look like:

Any ideas on this sorting/ranking/labelling?

EDIT

My original model ignored a very prevalent issue.

As there are feasibly some ids that have multiple tests performed on them in parallel, therefore they show in multiple rows in the datebase, with matching dates (date corresponds to when they were logged). These should be counted as the same date and not increment the date_rank: I've generated a model, with updated date_rank to demonstrate how this would look:

df = pd.DataFrame({
'id':[1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6,6,6,7,7], 
'value':[.01, .4, .5, .7, .77, .1,.2, 0.3, .11, .21, .4, .01, 3, .5, .8, .9, .1],
'date':['10/01/2017 15:45:00','10/01/2017 15:45:00','05/01/2017 15:56:00',
        '11/01/2017 15:22:00','11/01/2017 15:22:00','06/01/2017 11:02:00','05/01/2017 09:37:00','05/01/2017 09:37:00','05/01/2017 09:55:00',
        '05/01/2017 09:55:00','05/01/2017 10:08:00','05/01/2017 10:09:00','03/02/2017 08:55:00',
        '03/02/2017 09:15:00','03/02/2017 09:31:00','09/01/2017 15:42:00',
        '19/01/2017 16:34:00']})

And the counter would afford this:

回答1:

You can try of sorting date values in descending and aggregating the 'id' group values

@praveen's logic is very simpler, by extending of logic, you can use astype of category to convert the values to categories and can retrive the codes (keys') of that categories, but it will be little bit different to your expected output

df1 = df.sort_values(['id', 'date'], ascending=[True, False])
df1['date_rank'] =df1.groupby(['id']).apply(lambda x: x['date'].astype('category',ordered=False).cat.codes+1).values

Out:

                 date   id  value   date_rank
0   10/01/2017 15:45:00 1   0.01    2
1   10/01/2017 15:45:00 1   0.40    2
2   05/01/2017 15:56:00 1   0.50    1
3   11/01/2017 15:22:00 2   0.70    1
4   11/01/2017 15:22:00 2   0.77    1
5   06/01/2017 11:02:00 3   0.10    2
6   05/01/2017 09:37:00 3   0.20    1
7   05/01/2017 09:37:00 3   0.30    1
8   05/01/2017 09:55:00 4   0.11    1
9   05/01/2017 09:55:00 4   0.21    1
11  05/01/2017 10:09:00 5   0.01    2
10  05/01/2017 10:08:00 5   0.40    1
14  03/02/2017 09:31:00 6   0.80    3
13  03/02/2017 09:15:00 6   0.50    2
12  03/02/2017 08:55:00 6   3.00    1
16  19/01/2017 16:34:00 7   0.10    2
15  09/01/2017 15:42:00 7   0.90    1

but to get your exact output, here i have used dictionary and reversing dictionary keys with extracting values

df1 = df.sort_values(['id', 'date'], ascending=[True, False])
df1['date_rank'] = df1.groupby(['id'])['date'].transform(lambda x: list(map(lambda y: dict(map(reversed, dict(enumerate(x.unique())).items()))[y]+1,x)) )

Out:

                date    id  value   date_rank
0   10/01/2017 15:45:00 1   0.01    1
1   10/01/2017 15:45:00 1   0.40    1
2   05/01/2017 15:56:00 1   0.50    2
3   11/01/2017 15:22:00 2   0.70    1
4   11/01/2017 15:22:00 2   0.77    1
5   06/01/2017 11:02:00 3   0.10    1
6   05/01/2017 09:37:00 3   0.20    2
7   05/01/2017 09:37:00 3   0.30    2
8   05/01/2017 09:55:00 4   0.11    1
9   05/01/2017 09:55:00 4   0.21    1
11  05/01/2017 10:09:00 5   0.01    1
10  05/01/2017 10:08:00 5   0.40    2
14  03/02/2017 09:31:00 6   0.80    1
13  03/02/2017 09:15:00 6   0.50    2
12  03/02/2017 08:55:00 6   3.00    3
16  19/01/2017 16:34:00 7   0.10    1
15  09/01/2017 15:42:00 7   0.90    2

回答2:

You can do this by sort_values, groupby and cumcount

df['date_rank'] = df.sort_values(['id', 'date'], ascending=[True, False]).groupby(['id']).cumcount() + 1

demo

In [1]: df = pd.DataFrame({
    ...: 'id':[1, 1, 2, 3, 3, 4, 5, 6,6,6,7,7],
    ...: 'value':[.01, .4, .2, .3, .11, .21, .4, .01, 3, .5, .8, .9],
    ...: 'date':['10/01/2017 15:45:00','05/01/2017 15:56:00',
    ...:         '11/01/2017 15:22:00','06/01/2017 11:02:00','05/01/2017 09:37:00',
    ...:         '05/01/2017 09:55:00','05/01/2017 10:08:00','03/02/2017 08:55:00',
    ...:         '03/02/2017 09:15:00','03/02/2017 09:31:00','09/01/2017 15:42:00',
    ...:         '19/01/2017 16:34:00']})
    ...:

In [2]: df['date_rank'] = df.sort_values(['id', 'date'], ascending=[True, False]).groupby(['id']).cumcount() + 1
    ...:

In [3]: df
Out[3]:
    id  value                 date  date_rank
0    1   0.01  10/01/2017 15:45:00          1
1    1   0.40  05/01/2017 15:56:00          2
2    2   0.20  11/01/2017 15:22:00          1
3    3   0.30  06/01/2017 11:02:00          1
4    3   0.11  05/01/2017 09:37:00          2
5    4   0.21  05/01/2017 09:55:00          1
6    5   0.40  05/01/2017 10:08:00          1
7    6   0.01  03/02/2017 08:55:00          3
8    6   3.00  03/02/2017 09:15:00          2
9    6   0.50  03/02/2017 09:31:00          1
10   7   0.80  09/01/2017 15:42:00          2
11   7   0.90  19/01/2017 16:34:00          1

Edit

you can do this by rank method

df.groupby(['id'])['date'].rank(ascending=False, method='dense').astype(int)

demo

In [1]: df['rank'] = df.groupby(['id'])['date'].rank(ascending=False, method='dense').astype(int)

In [2]: df
Out[2]:
    id  value                date  rank
0    1   0.01 2017-10-01 15:45:00     1
1    1   0.40 2017-10-01 15:45:00     1
2    1   0.50 2017-05-01 15:56:00     2
3    2   0.70 2017-11-01 15:22:00     1
4    2   0.77 2017-11-01 15:22:00     1
5    3   0.10 2017-06-01 11:02:00     1
6    3   0.20 2017-05-01 09:37:00     2
7    3   0.30 2017-05-01 09:37:00     2
8    4   0.11 2017-05-01 09:55:00     1
9    4   0.21 2017-05-01 09:55:00     1
10   5   0.40 2017-05-01 10:08:00     2
11   5   0.01 2017-05-01 10:09:00     1
12   6   3.00 2017-03-02 08:55:00     3
13   6   0.50 2017-03-02 09:15:00     2
14   6   0.80 2017-03-02 09:31:00     1
15   7   0.90 2017-09-01 15:42:00     1
16   7   0.10 2017-01-19 16:34:00     2

来源：https://stackoverflow.com/questions/52661772/sorting-and-ranking-by-dates-on-a-group-in-a-pandas-df

标签

python

pandas

sorting

group-by

ranking