pandas: filter rows of DataFrame with operator chaining

前端未结

关注

 14  2222

Most operations in pandas can be accomplished with operator chaining (groupby, aggregate, apply, etc), but the only way I

相关标签:

14条回答

爱一瞬间的悲伤

2020-11-22 17:16

I offer this for additional examples. This is the same answer as https://stackoverflow.com/a/28159296/

I'll add other edits to make this post more useful.

pandas.DataFrame.query
query was made for exactly this purpose. Consider the dataframe df

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(
    np.random.randint(10, size=(10, 5)),
    columns=list('ABCDE')
)

df

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
6  8  7  6  4  7
7  6  2  6  6  5
8  2  8  7  5  8
9  4  7  6  1  5

Let's use query to filter all rows where D > B

df.query('D > B')

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
7  6  2  6  6  5

Which we chain

df.query('D > B').query('C > B')
# equivalent to
# df.query('D > B and C > B')
# but defeats the purpose of demonstrating chaining

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
4  3  6  7  7  4
5  5  3  7  5  9
7  6  2  6  6  5

0 讨论(0)

故里飘歌

2020-11-22 17:17
Just want to add a demonstration using loc to filter not only by rows but also by columns and some merits to the chained operation.

The code below can filter the rows by value.
```
df_filtered = df.loc[df['column'] == value]
```
By modifying it a bit you can filter the columns as well.
```
df_filtered = df.loc[df['column'] == value, ['year', 'column']]
```
So why do we want a chained method? The answer is that it is simple to read if you have many operations. For example,
```
res =  df\
    .loc[df['station']=='USA', ['TEMP', 'RF']]\
    .groupby('year')\
    .agg(np.nanmean)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

情深已故

2020-11-22 17:17

If you set your columns to search as indexes, then you can use DataFrame.xs() to take a cross section. This is not as versatile as the query answers, but it might be useful in some situations.

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(
    np.random.randint(3, size=(10, 5)),
    columns=list('ABCDE')
)

df
# Out[55]: 
#    A  B  C  D  E
# 0  0  2  2  2  2
# 1  1  1  2  0  2
# 2  0  2  0  0  2
# 3  0  2  2  0  1
# 4  0  1  1  2  0
# 5  0  0  0  1  2
# 6  1  0  1  1  1
# 7  0  0  2  0  2
# 8  2  2  2  2  2
# 9  1  2  0  2  1

df.set_index(['A', 'D']).xs([0, 2]).reset_index()
# Out[57]: 
#    A  D  B  C  E
# 0  0  2  2  2  2
# 1  0  2  1  1  0

0 讨论(0)

遥遥无期

2020-11-22 17:22
I had the same question except that I wanted to combine the criteria into an OR condition. The format given by Wouter Overmeire combines the criteria into an AND condition such that both must be satisfied:
```
In [96]: df
Out[96]:
   A  B  C  D
a  1  4  9  1
b  4  5  0  2
c  5  5  1  0
d  1  3  9  6

In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
   A  B  C  D
d  1  3  9  6
```
But I found that, if you wrap each condition in (... == True) and join the criteria with a pipe, the criteria are combined in an OR condition, satisfied whenever either of them is true:
```
df[((df.A==1) == True) | ((df.D==6) == True)]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
余生分开走

2020-11-22 17:22
This solution is more hackish in terms of implementation, but I find it much cleaner in terms of usage, and it is certainly more general than the others proposed.

https://github.com/toobaz/generic_utils/blob/master/generic_utils/pandas/where.py

You don't need to download the entire repo: saving the file and doing
```
from where import where as W
```
should suffice. Then you use it like this:
```
df = pd.DataFrame([[1, 2, True],
                   [3, 4, False], 
                   [5, 7, True]],
                  index=range(3), columns=['a', 'b', 'c'])
# On specific column:
print(df.loc[W['a'] > 2])
print(df.loc[-W['a'] == W['b']])
print(df.loc[~W['c']])
# On entire - or subset of a - DataFrame:
print(df.loc[W.sum(axis=1) > 3])
print(df.loc[W[['a', 'b']].diff(axis=1)['b'] > 1])
```
A slightly less stupid usage example:
```
data = pd.read_csv('ugly_db.csv').loc[~(W == '$null$').any(axis=1)]
```
By the way: even in the case in which you are just using boolean cols,
```
df.loc[W['cond1']].loc[W['cond2']]
```
can be much more efficient than
```
df.loc[W['cond1'] & W['cond2']]
```
because it evaluates cond2 only where cond1 is True.

DISCLAIMER: I first gave this answer elsewhere because I hadn't seen this.
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2020-11-22 17:25
The answer from @lodagro is great. I would extend it by generalizing the mask function as:
```
def mask(df, f):
  return df[f(df)]
```
Then you can do stuff like:
```
df.mask(lambda x: x[0] < 0).mask(lambda x: x[1] > 0)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 下一页