Python pandas idxmax for multiple indexes in a dataframe

问题

I have a series that looks like this:

            delivery
2007-04-26  706           23
2007-04-27  705           10
            706         1089
            708           83
            710           13
            712           51
            802            4
            806            1
            812            3
2007-04-29  706           39
            708            4
            712            1
2007-04-30  705            3
            706         1016
            707            2
...
2014-11-04  1412          53
            1501           1
            1502           1
            1512           1
2014-11-05  1411          47
            1412        1334
            1501          40
            1502         433
            1504         126
            1506         100
            1508           7
            1510           6
            1512          51
            1604           1
            1612           5
Length: 26255, dtype: int64

where the query is: df.groupby([df.index.date, 'delivery']).size()

For each day, I need to pull out the delivery number which has the most volume. I feel like it would be something like:

df.groupby([df.index.date, 'delivery']).size().idxmax(axis=1)

However, this just returns me the idxmax for the entire dataframe; instead, I need the second-level idmax (not the date but rather the delivery number) for each day, not the entire dataframe (ie. it returns a vector).

Any ideas on how to accomplish this?

回答1:

Your example code doesn't work because the idxmax is executed after the groupby operation (so on the whole dataframe)

I'm not sure how to use idxmax on multilevel indexes, so here's a simple workaround.

Setting up data :

import pandas as pd
d= {'Date': ['2007-04-26', '2007-04-27', '2007-04-27', '2007-04-27',
             '2007-04-27', '2007-04-28', '2007-04-28'], 
        'DeliveryNb': [706, 705, 708, 450, 283, 45, 89],
        'DeliveryCount': [23, 10, 1089, 82, 34, 100, 11]}

df = pd.DataFrame.from_dict(d, orient='columns').set_index('Date')
print df

output

            DeliveryCount  DeliveryNb
Date                                 
2007-04-26             23         706
2007-04-27             10         705
2007-04-27           1089         708
2007-04-27             82         450
2007-04-27             34         283
2007-04-28            100          45
2007-04-28             11          89

creating custom function :

The trick is to use the reset_index() method (so you easily get the integer index of the group)

def func(df):
    idx = df.reset_index()['DeliveryCount'].idxmax()
    return df['DeliveryNb'].iloc[idx]

applying it :

g = df.groupby(df.index)
g.apply(func)

result :

Date
2007-04-26    706
2007-04-27    708
2007-04-28     45
dtype: int64

回答2:

Suppose you have this series:

            delivery
2001-01-02  0           2
            1           3
            6           2
            7           2
            9           3
2001-01-03  3           2
            6           1
            7           1
            8           3
            9           1
dtype: int64

If you want one delivery per date with the maximum value, you could use idxmax:

dates = series.index.get_level_values(0)
series.loc[series.groupby(dates).idxmax()]

yields

            delivery
2001-01-02  1           3
2001-01-03  8           3
dtype: int64

If you want all deliveries per date with the maximum value, use transform to generate a boolean mask:

mask = series.groupby(dates).transform(lambda x: x==x.max()).astype('bool')
series.loc[mask]

yields

            delivery
2001-01-02  1           3
            9           3
2001-01-03  8           3
dtype: int64

This is the code I used to generate series:

import pandas as pd
import numpy as np

np.random.seed(1)
N = 20
rng = pd.date_range('2001-01-02', periods=N//2, freq='4H')
rng = np.random.choice(rng, N, replace=True)
rng.sort()
df = pd.DataFrame(np.random.randint(10, size=(N,)), columns=['delivery'], index=rng)
series = df.groupby([df.index.date, 'delivery']).size()

回答3:

If you have the following dataframe (you can always reset the index if needed with : df = df.reset_index() :

  Date  Del_Count  Del_Nb
0  1/1      14      19   <
1           11      17
2  2/2      25      29   <
3           21      27
4           22      28
5  3/3      34      36
6           37      37
7           31      39   <

To find the max per Date and extract the relevant Del_Count you can use:

df = df.ix[df.groupby(['Date'], sort=False)['Del_Nb'].idxmax()][['Date','Del_Count','Del_Nb']]

Which would yeild:

 Date  Del_Count  Del_Nb
0  1/1         14      19
2  2/2         25      29
7  3/3         31      39

来源：https://stackoverflow.com/questions/27914360/python-pandas-idxmax-for-multiple-indexes-in-a-dataframe

标签

python

pandas

multi-index