SQL-like window functions in PANDAS: Row Numbering in Python Pandas Dataframe

前端 未结 5 871
我寻月下人不归
我寻月下人不归 2020-11-27 03:49

I come from a sql background and I use the following data processing step frequently:

  1. Partition the table of data by one or more fields
  2. For each parti
相关标签:
5条回答
  • 2020-11-27 04:07

    pandas.lib.fast_zip() can create a tuple array from a list of array. You can use this function to create a tuple series, and then rank it:

    values = {'key1' : ['a','a','a','b','a','b'],
              'data1' : [1,2,2,3,3,3],
              'data2' : [1,10,2,3,30,20]}
    
    df = pd.DataFrame(values, index=list("abcdef"))
    
    def rank_multi_columns(df, cols, **kw):
        data = []
        for col in cols:
            if col.startswith("-"):
                flag = -1
                col = col[1:]
            else:
                flag = 1
            data.append(flag*df[col])
        values = pd.lib.fast_zip(data)
        s = pd.Series(values, index=df.index)
        return s.rank(**kw)
    
    rank = df.groupby("key1").apply(lambda df:rank_multi_columns(df, ["data1", "-data2"]))
    
    print rank
    

    the result:

    a    1
    b    2
    c    3
    d    2
    e    4
    f    1
    dtype: float64
    
    0 讨论(0)
  • 2020-11-27 04:15

    You can use transform and Rank together Here is an example

    df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
               'C2' : [1,2,3,4,5]})
    df['Rank'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.rank())
    df
    

    Have a look at Pandas Rank method for more information

    0 讨论(0)
  • 2020-11-27 04:21

    Use groupby.rank function. Here the working example.

    df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
    df
    
    C1 C2
    a  1
    a  2
    a  3
    b  4
    b  5
    
    df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
    df
    
    C1 C2 RANK
    a  1  1
    a  2  2
    a  3  3
    b  4  1
    b  5  2
    
    
    0 讨论(0)
  • 2020-11-27 04:24

    you can also use sort_values(), groupby() and finally cumcount() + 1:

    df['RN'] = df.sort_values(['data1','data2'], ascending=[True,False]) \
                 .groupby(['key1']) \
                 .cumcount() + 1
    print(df)
    

    yields:

       data1  data2 key1  RN
    0      1      1    a   1
    1      2     10    a   2
    2      2      2    a   3
    3      3      3    b   1
    4      3     30    a   4
    

    PS tested with pandas 0.18

    0 讨论(0)
  • 2020-11-27 04:27

    You can do this by using groupby twice along with the rank method:

    In [11]: g = df.groupby('key1')
    

    Use the min method argument to give values which share the same data1 the same RN:

    In [12]: g['data1'].rank(method='min')
    Out[12]:
    0    1
    1    2
    2    2
    3    1
    4    4
    dtype: float64
    
    In [13]: df['RN'] = g['data1'].rank(method='min')
    

    And then groupby these results and add the rank with respect to data2:

    In [14]: g1 = df.groupby(['key1', 'RN'])
    
    In [15]: g1['data2'].rank(ascending=False) - 1
    Out[15]:
    0    0
    1    0
    2    1
    3    0
    4    0
    dtype: float64
    
    In [16]: df['RN'] += g1['data2'].rank(ascending=False) - 1
    
    In [17]: df
    Out[17]:
       data1  data2 key1  RN
    0      1      1    a   1
    1      2     10    a   2
    2      2      2    a   3
    3      3      3    b   1
    4      3     30    a   4
    

    It feels like there ought to be a native way to do this (there may well be!...).

    0 讨论(0)
提交回复
热议问题