What is the python equivalent to the TOP function in T-SQL? I\'m looking to filter my dataframe to the top 50K rows. I\'ve looked online and I can\'t find a simple example.
UPDATE: - shows different pandas approaches, including:
top N rows per group
top N rows with offset
equivalent for SQL aggregate functions:
ROW_NUMBER() / RANK() OVER(PARTITION BY ... ORDER BY ...)
sample DF:
df = pd.DataFrame({
'dep': np.random.choice(list('ABC'), 20),
'manager_id': np.random.randint(0, 10, 20),
'salary': np.random.randint(5000, 5006, 20)
})
----------------------- Original DF ------------------------
In [2]: df
Out[2]:
dep manager_id salary
0 B 5 5005
1 A 6 5001
2 C 8 5000
3 A 7 5000
4 B 0 5002
5 A 3 5003
6 A 2 5004
7 A 2 5004
8 C 3 5002
9 C 4 5001
10 A 9 5002
11 C 9 5000
12 B 8 5004
13 A 1 5003
14 C 7 5005
15 B 0 5002
16 B 2 5003
17 A 4 5000
18 B 2 5003
19 B 7 5003
------------------ top 5 rows (sorted by original index) -------------------
In [3]: df.head(5)
Out[3]:
dep manager_id salary
0 B 5 5005
1 A 6 5001
2 C 8 5000
3 A 7 5000
4 B 0 5002
--- top 5 rows (sorted by manager_id
DESC, dep
ASC) ----
In [4]: df.sort_values(by=['manager_id', 'dep'], ascending=[False,True]).head(5)
Out[4]:
dep manager_id salary
10 A 9 5002
11 C 9 5000
12 B 8 5004
2 C 8 5000
3 A 7 5000
--- equivalent for SELECT * FROM tab ORDER BY salary DESC LIMIT 5 OFFSET 3
---
In [19]: df.nlargest(5+3, columns=['salary']).tail(5)
Out[19]:
dep manager_id salary
7 A 2 5004
12 B 8 5004
5 A 3 5003
13 A 1 5003
16 B 2 5003
---- top 2 salaries in each department (no duplicates) -----
--- equivalent for SQL: row_number() over(partition by DEP order by SALARY desc)
---
In [7]: (df.assign(rn=df.sort_values(['salary'], ascending=False)
...: .groupby(['dep'])
...: .cumcount() + 1)
...: .query('rn < 3')
...: .sort_values(['dep','rn'])
...: )
Out[7]:
dep manager_id salary rn
6 A 2 5004 1
7 A 2 5004 2
0 B 5 5005 1
12 B 8 5004 2
14 C 7 5005 1
8 C 3 5002 2
--- top 2 salaries in each department (using "nlargest") ----
In [15]: df.loc[df.groupby('dep')['salary'].nlargest(2).reset_index()['level_1']]
Out[15]:
dep manager_id salary
6 A 2 5004
7 A 2 5004
0 B 5 5005
12 B 8 5004
14 C 7 5005
8 C 3 5002
--- second and third highest salaries in each department ---
In [16]: (df.assign(rn=df.sort_values(['salary'], ascending=False)
....: .groupby(['dep'])
....: .cumcount() + 1)
....: .query('rn >= 2 and rn <= 3')
....: .sort_values(['dep','rn'])
....: )
Out[16]:
dep manager_id salary rn
7 A 2 5004 2
13 A 1 5003 3
12 B 8 5004 2
18 B 2 5003 3
8 C 3 5002 2
9 C 4 5001 3
--- top 2 salaries in each department (with duplicates) ----
--- equivalent for SQL: rank() over(partition by DEP order by SALARY desc)
---
In [18]: (df.assign(rnk=df.groupby(['dep'])['salary']
....: .rank(method='min', ascending=False))
....: .query('rnk < 3')
....: .sort_values(['dep','rnk'])
....: )
Out[18]:
dep manager_id salary rnk
6 A 2 5004 1.0
7 A 2 5004 1.0
0 B 5 5005 1.0
12 B 8 5004 2.0
14 C 7 5005 1.0
8 C 3 5002 2.0