I have a dataframe that I subset like this:
a b x y
0 1 2 3 -1
1 2 4 6 -2
2 3 6 6 -3
3 4 8 3 -4
df = df[(df.a >= 2) & (df.b <=
As long as you can categorize a step as something that returns a DataFrame, and takes a DataFrame (with possibly more arguments), then you can use pipe
. Whether there's an advantage to doing so, is another question.
Here, e.g., you can use
df\
.pipe(lambda df_, x, y: df_[(df_.a >= x) & (df_.b <= y)], 2, 8)\
.pipe(lambda df_: df_.groupby(df_.x))\
.mean()
Notice how the first stage is a lambda that takes 3 arguments, with the 2 and 8 passed as parameters. That's not the only way to do so - it is equivalent to
.pipe(lambda df_: df_[(df_.a >= 2) & (df_.b <= 8)])\
Also note that you can use
df\
.pipe(lambda df_, x, y: df[(df.a >= x) & (df.b <= y)], 2, 8)\
.groupby('x')\
.mean()
Here the lambda takes df_
, but operates on df
, and the second pipe
has been replaced with a groupby
.
The first change works here, but is gragile. It happens to work since this is the first pipe stage. If it would be a later stage, it might take a DataFrame with one dimension, and attempt to filter it on a mask with another dimension, for example.
The second change is fine. In face, I think it is more readable. Basically, anything that takes a DataFrame and returns one, can be either be called directly or through pipe
.
You can try, but I think it is more complicated:
print df[(df.a >= 2) & (df.b <= 8)].groupby(df.x).mean()
a b x y
x
3 4.0 8 3 -4.0
6 2.5 5 6 -2.5
def masker(df, mask):
return df[mask]
mask1 = (df.a >= 2)
mask2 = (df.b <= 8)
print df.pipe(masker, mask1).pipe(masker, mask2).groupby(df.x).mean()
a b x y
x
3 4.0 8 3 -4.0
6 2.5 5 6 -2.5
I believe this method is clear with regard to your filtering steps and subsequent operations. Using loc[(mask1) & (mask2)]
is probably more performant, however.
>>> (df
.pipe(lambda x: x.loc[x.a >= 2])
.pipe(lambda x: x.loc[x.b <= 8])
.pipe(pd.DataFrame.groupby, 'x')
.mean()
)
a b y
x
3 4.0 8 -4.0
6 2.5 5 -2.5
Alternatively:
(df
.pipe(lambda x: x.loc[x.a >= 2])
.pipe(lambda x: x.loc[x.b <= 8])
.groupby('x')
.mean()
)