Given a Pandas Data Frame that looks like this
GROUP VALUE MASK
1 5 false
2 10 false
2 20 false
1 7 tr
use groupby
, diff
, 'MASK'
pd.concat([df.VALUE, df.groupby('GROUP').VALUE.diff()],
axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
Actually the bottleneck is groupby
. You don't actually need to use groupby
for this specific problem. To sort
the dataframe by GROUP
, perform diff
on the sorted dataframe and filter
by MASK
should be okay. We must use kind='mergesort'
to keep the order unchanged within the group before and after sorting,
Assume MASK is always False for the first element of each group ( since the first element is meaningless for difference operation ), you can use this
pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
Performance Tests:
MAXN = 200000
GROUPS = 10000
df = pd.DataFrame({"GROUP": np.ceil(np.random.rand(MAXN)*GROUPS), "VALUE": np.ceil(np.random.rand(MAXN)*10000), "MASK":np.floor(np.random.rand(MAXN)*2).astype("bool")})
%timeit t1 = pd.concat([df.VALUE, df.groupby('GROUP').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
# 1 loop, best of 3: 1.28 s per loop
%timeit t2 = pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
#10 loops, best of 3: 63.1 ms per loop
#MAXN = 2000000
#GROUPS = 1000000
%timeit t2 = pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
#1 loop, best of 3: 1.24 s per loop