Given a data frame that looks like this
GROUP VALUE
1 5
2 2
1 10
2 20
1 7
I would like to compute the differe
Using @unutbu 's df
per timing
unutbu's solution is best over large data sets
import pandas as pd
import numpy as np
df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]})
df.groupby('GROUP')['VALUE'].agg(np.ptp)
GROUP
1 5
2 18
Name: VALUE, dtype: int64
np.ptp docs returns the range of an array
timing
small df
large df
df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 100, VALUE=np.random.rand(1000000)))
large df
many groups
df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 10000, VALUE=np.random.rand(1000000)))
groupby/agg
generally performs best when you take advantage of the built-in aggregators such as 'max'
and 'min'
. So to obtain the difference, first compute the max
and min
and then subtract:
import pandas as pd
df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]})
result = df.groupby('GROUP')['VALUE'].agg(['max','min'])
result['diff'] = result['max']-result['min']
print(result[['diff']])
yields
diff
GROUP
1 5
2 18
You can use groupby()
, min()
, and max()
:
df.groupby('GROUP')['VALUE'].apply(lambda g: g.max() - g.min())