Custom sorting in pandas dataframe

前端 未结 4 2084
情歌与酒
情歌与酒 2020-11-22 16:34

I have python pandas dataframe, in which a column contains month name.

How can I do a custom sort using a dictionary, for example:

custom_dict = {\'         


        
相关标签:
4条回答
  • 2020-11-22 16:54

    Pandas 0.15 introduced Categorical Series, which allows a much clearer way to do this:

    First make the month column a categorical and specify the ordering to use.

    In [21]: df['m'] = pd.Categorical(df['m'], ["March", "April", "Dec"])
    
    In [22]: df  # looks the same!
    Out[22]:
       a  b      m
    0  1  2  March
    1  5  6    Dec
    2  3  4  April
    

    Now, when you sort the month column it will sort with respect to that list:

    In [23]: df.sort_values("m")
    Out[23]:
       a  b      m
    0  1  2  March
    2  3  4  April
    1  5  6    Dec
    

    Note: if a value is not in the list it will be converted to NaN.


    An older answer for those interested...

    You could create an intermediary series, and set_index on that:

    df = pd.DataFrame([[1, 2, 'March'],[5, 6, 'Dec'],[3, 4, 'April']], columns=['a','b','m'])
    s = df['m'].apply(lambda x: {'March':0, 'April':1, 'Dec':3}[x])
    s.sort_values()
    
    In [4]: df.set_index(s.index).sort()
    Out[4]: 
       a  b      m
    0  1  2  March
    1  3  4  April
    2  5  6    Dec
    

    As commented, in newer pandas, Series has a replace method to do this more elegantly:

    s = df['m'].replace({'March':0, 'April':1, 'Dec':3})
    

    The slight difference is that this won't raise if there is a value outside of the dictionary (it'll just stay the same).

    0 讨论(0)
  • 2020-11-22 16:59

    A bit late to the game, but here's a way to create a function that sorts pandas Series, DataFrame, and multiindex DataFrame objects using arbitrary functions.

    I make use of the df.iloc[index] method, which references a row in a Series/DataFrame by position (compared to df.loc, which references by value). Using this, we just have to have a function that returns a series of positional arguments:

    def sort_pd(key=None,reverse=False,cmp=None):
        def sorter(series):
            series_list = list(series)
            return [series_list.index(i) 
               for i in sorted(series_list,key=key,reverse=reverse,cmp=cmp)]
        return sorter
    

    You can use this to create custom sorting functions. This works on the dataframe used in Andy Hayden's answer:

    df = pd.DataFrame([
        [1, 2, 'March'],
        [5, 6, 'Dec'],
        [3, 4, 'April']], 
      columns=['a','b','m'])
    
    custom_dict = {'March':0, 'April':1, 'Dec':3}
    sort_by_custom_dict = sort_pd(key=custom_dict.get)
    
    In [6]: df.iloc[sort_by_custom_dict(df['m'])]
    Out[6]:
       a  b  m
    0  1  2  March
    2  3  4  April
    1  5  6  Dec
    

    This also works on multiindex DataFrames and Series objects:

    months = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
    
    df = pd.DataFrame([
        ['New York','Mar',12714],
        ['New York','Apr',89238],
        ['Atlanta','Jan',8161],
        ['Atlanta','Sep',5885],
      ],columns=['location','month','sales']).set_index(['location','month'])
    
    sort_by_month = sort_pd(key=months.index)
    
    In [10]: df.iloc[sort_by_month(df.index.get_level_values('month'))]
    Out[10]:
                     sales
    location  month  
    Atlanta   Jan    8161
    New York  Mar    12714
              Apr    89238
    Atlanta   Sep    5885
    
    sort_by_last_digit = sort_pd(key=lambda x: x%10)
    
    In [12]: pd.Series(list(df['sales'])).iloc[sort_by_last_digit(df['sales'])]
    Out[12]:
    2    8161
    0   12714
    3    5885
    1   89238
    

    To me this feels clean, but it uses python operations heavily rather than relying on optimized pandas operations. I haven't done any stress testing but I'd imagine this could get slow on very large DataFrames. Not sure how the performance compares to adding, sorting, then deleting a column. Any tips on speeding up the code would be appreciated!

    0 讨论(0)
  • 2020-11-22 17:02

    pandas >= 1.1

    You will soon be able to use sort_values with key argument:

    pd.__version__
    # '1.1.0.dev0+2004.g8d10bfb6f'
    
    custom_dict = {'March': 0, 'April': 1, 'Dec': 3} 
    df
    
       a  b      m
    0  1  2  March
    1  5  6    Dec
    2  3  4  April
    
    df.sort_values(by=['m'], key=lambda x: x.map(custom_dict))
    
       a  b      m
    0  1  2  March
    2  3  4  April
    1  5  6    Dec

    The key argument takes as input a Series and returns a Series. This series is internally argsorted and the sorted indices are used to reorder the input DataFrame. If there are multiple columns to sort on, the key function will be applied to each one in turn. See Sorting with keys.


    pandas <= 1.0.X

    One simple method is using the output Series.map and Series.argsort to index into df using DataFrame.iloc (since argsort produces sorted integer positions); since you have a dictionary; this becomes easy.

    df.iloc[df['m'].map(custom_dict).argsort()]
    
       a  b      m
    0  1  2  March
    2  3  4  April
    1  5  6    Dec
    

    If you need to sort in descending order, invert the mapping.

    df.iloc[(-df['m'].map(custom_dict)).argsort()]
    
       a  b      m
    1  5  6    Dec
    2  3  4  April
    0  1  2  March
    

    Note that this only works on numeric items. Otherwise, you will need to workaround this using sort_values, and accessing the index:

    df.loc[df['m'].map(custom_dict).sort_values(ascending=False).index]
    
       a  b      m
    1  5  6    Dec
    2  3  4  April
    0  1  2  March
    

    More options are available with astype (this is deprecated now), or pd.Categorical, but you need to specify ordered=True for it to work correctly.

    # Older version,
    # df['m'].astype('category', 
    #                categories=sorted(custom_dict, key=custom_dict.get), 
    #                ordered=True)
    df['m'] = pd.Categorical(df['m'], 
                             categories=sorted(custom_dict, key=custom_dict.get), 
                             ordered=True)
    

    Now, a simple sort_values call will do the trick:

    df.sort_values('m')
     
       a  b      m
    0  1  2  March
    2  3  4  April
    1  5  6    Dec
    

    The categorical ordering will also be honoured when groupby sorts the output.

    0 讨论(0)
  • 2020-11-22 17:10
    import pandas as pd
    custom_dict = {'March':0,'April':1,'Dec':3}
    
    df = pd.DataFrame(...) # with columns April, March, Dec (probably alphabetically)
    
    df = pd.DataFrame(df, columns=sorted(custom_dict, key=custom_dict.get))
    

    returns a DataFrame with columns March, April, Dec

    0 讨论(0)
提交回复
热议问题