Pandas: top N rows, top N rows per group, equivalent for ROW_NUMBER OVER(PARTITION BY … ORDER BY …)

后端 未结 1 1916
情书的邮戳
情书的邮戳 2020-12-18 02:12

What is the python equivalent to the TOP function in T-SQL? I\'m looking to filter my dataframe to the top 50K rows. I\'ve looked online and I can\'t find a simple example.

相关标签:
1条回答
  • 2020-12-18 02:43

    UPDATE: - shows different pandas approaches, including:

    top N rows per group

    top N rows with offset

    equivalent for SQL aggregate functions:

    ROW_NUMBER() / RANK() OVER(PARTITION BY ... ORDER BY ...)

    sample DF:

    df = pd.DataFrame({
      'dep': np.random.choice(list('ABC'), 20),
      'manager_id': np.random.randint(0, 10, 20),
      'salary': np.random.randint(5000, 5006, 20)
    })
    

    ----------------------- Original DF ------------------------

    In [2]: df
    Out[2]:
       dep  manager_id  salary
    0    B           5    5005
    1    A           6    5001
    2    C           8    5000
    3    A           7    5000
    4    B           0    5002
    5    A           3    5003
    6    A           2    5004
    7    A           2    5004
    8    C           3    5002
    9    C           4    5001
    10   A           9    5002
    11   C           9    5000
    12   B           8    5004
    13   A           1    5003
    14   C           7    5005
    15   B           0    5002
    16   B           2    5003
    17   A           4    5000
    18   B           2    5003
    19   B           7    5003
    

    ------------------ top 5 rows (sorted by original index) -------------------

    In [3]: df.head(5)
    Out[3]:
      dep  manager_id  salary
    0   B           5    5005
    1   A           6    5001
    2   C           8    5000
    3   A           7    5000
    4   B           0    5002
    

    --- top 5 rows (sorted by manager_id DESC, dep ASC) ----

    In [4]: df.sort_values(by=['manager_id', 'dep'], ascending=[False,True]).head(5)
    Out[4]:
       dep  manager_id  salary
    10   A           9    5002
    11   C           9    5000
    12   B           8    5004
    2    C           8    5000
    3    A           7    5000
    

    --- equivalent for SELECT * FROM tab ORDER BY salary DESC LIMIT 5 OFFSET 3 ---

    In [19]: df.nlargest(5+3, columns=['salary']).tail(5)
    Out[19]:
       dep  manager_id  salary
    7    A           2    5004
    12   B           8    5004
    5    A           3    5003
    13   A           1    5003
    16   B           2    5003
    

    ---- top 2 salaries in each department (no duplicates) -----

    --- equivalent for SQL: row_number() over(partition by DEP order by SALARY desc) ---

    In [7]: (df.assign(rn=df.sort_values(['salary'], ascending=False)
       ...:                 .groupby(['dep'])
       ...:                 .cumcount() + 1)
       ...:    .query('rn < 3')
       ...:    .sort_values(['dep','rn'])
       ...: )
    Out[7]:
       dep  manager_id  salary  rn
    6    A           2    5004   1
    7    A           2    5004   2
    0    B           5    5005   1
    12   B           8    5004   2
    14   C           7    5005   1
    8    C           3    5002   2
    

    --- top 2 salaries in each department (using "nlargest") ----

    In [15]: df.loc[df.groupby('dep')['salary'].nlargest(2).reset_index()['level_1']]
    Out[15]:
       dep  manager_id  salary
    6    A           2    5004
    7    A           2    5004
    0    B           5    5005
    12   B           8    5004
    14   C           7    5005
    8    C           3    5002
    

    --- second and third highest salaries in each department ---

    In [16]: (df.assign(rn=df.sort_values(['salary'], ascending=False)
       ....:                 .groupby(['dep'])
       ....:                 .cumcount() + 1)
       ....:    .query('rn >= 2 and rn <= 3')
       ....:    .sort_values(['dep','rn'])
       ....: )
    Out[16]:
       dep  manager_id  salary  rn
    7    A           2    5004   2
    13   A           1    5003   3
    12   B           8    5004   2
    18   B           2    5003   3
    8    C           3    5002   2
    9    C           4    5001   3
    

    --- top 2 salaries in each department (with duplicates) ----

    --- equivalent for SQL: rank() over(partition by DEP order by SALARY desc) ---

    In [18]: (df.assign(rnk=df.groupby(['dep'])['salary']
       ....:                  .rank(method='min', ascending=False))
       ....:    .query('rnk < 3')
       ....:    .sort_values(['dep','rnk'])
       ....: )
    Out[18]:
       dep  manager_id  salary  rnk
    6    A           2    5004  1.0
    7    A           2    5004  1.0
    0    B           5    5005  1.0
    12   B           8    5004  2.0
    14   C           7    5005  1.0
    8    C           3    5002  2.0
    
    0 讨论(0)
提交回复
热议问题