How to get rows in pandas data frame, with maximal values in a column and keep the original index?

女生的网名这么多〃 提交于 2019-12-04 12:53:40
unutbu

When calling df.groupby(...).apply(foo), the type of object returned by foo affects the way the results are melded together.

If you return a Series, the index of the Series become columns of the final result, and the groupby key becomes the index (a bit of a mind-twister).

If instead you return a DataFrame, the final result uses the index of the DataFrame as index values, and the columns of the DataFrame as columns (very sensible).

So, you can arrange for the type of output you desire by converting your Series into a DataFrame.

With Pandas 0.13 you can use the to_frame().T method:

def maxrow(x, col):
    return x.loc[x[col].argmax()].to_frame().T

result = df.groupby('c1').apply(maxrow, 'c3')
result = result.reset_index(level=0, drop=True)
print(result)

yields

  c1 c2  c3
1  a  c   3
4  b  c  12

In Pandas 0.12 or older, the equivalent would be:

def maxrow(x, col):
    ser = x.loc[x[col].idxmax()]
    df = pd.DataFrame({ser.name: ser}).T
    return df

By the way, behzad.nouri's clever and elegant solution is quicker than mine for small DataFrames. The sort lifts the time complexity from O(n) to O(n log n) however, so it becomes slower than the to_frame solution shown above when applied to larger DataFrames.

Here is how I benchmarked it:

import pandas as pd
import numpy as np
import timeit


def reset_df_first(df):
    df2 = df.reset_index()
    result = df2.groupby('c1').apply(lambda x: x.loc[x['c3'].idxmax()])
    result.set_index(['index'], inplace=True)
    return result

def maxrow(x, col):
    result = x.loc[x[col].argmax()].to_frame().T
    return result

def using_to_frame(df):
    result = df.groupby('c1').apply(maxrow, 'c3')
    result.reset_index(level=0, drop=True, inplace=True)
    return result

def using_sort(df):
    return df.sort('c3').groupby('c1', as_index=False).tail(1)


for N in (100, 1000, 2000):
    df = pd.DataFrame({'c1': {0: 'a', 1: 'a', 2: 'a', 3: 'b', 4: 'b', 5: 'b'},
                       'c2': {0: 'a', 1: 'c', 2: 'b', 3: 'b', 4: 'c', 5: 'a'},
                       'c3': {0: 1, 1: 3, 2: 2, 3: 10, 4: 12, 5: 7}})

    df = pd.concat([df]*N)
    df.reset_index(inplace=True, drop=True)

    timing = dict()
    for func in (reset_df_first, using_to_frame, using_sort):
        timing[func] = timeit.timeit('m.{}(m.df)'.format(func.__name__),
                              'import __main__ as m ',
                              number=10)

    print('For N = {}'.format(N))
    for func in sorted(timing, key=timing.get):
        print('{:<20}: {:<0.3g}'.format(func.__name__, timing[func]))
    print

yields

For N = 100
using_sort          : 0.018
using_to_frame      : 0.0265
reset_df_first      : 0.0303

For N = 1000
using_to_frame      : 0.0358    \
using_sort          : 0.036     / this is roughly where the two methods cross over in terms of performance
reset_df_first      : 0.0432

For N = 2000
using_to_frame      : 0.0457
reset_df_first      : 0.0523
using_sort          : 0.0569

(reset_df_first was another possibility I tried.)

try this:

df.sort('c3').groupby('c1', as_index=False).tail(1)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!