问题
I've got a following data frame in Python:
df = pd.DataFrame.from_dict({'measurement_id': np.repeat([1, 2], [6, 6]),
'min': np.concatenate([np.repeat([1, 2, 3], [2, 2, 2]),
np.repeat([1, 2, 3], [2, 2, 2])]),
'obj': list('AB' * 6),
'var': [1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1]})
First, within each group defined by object
, I'd like to assign id to unique run of measurement_id
and var
columns. If any value of those columns changes, it starts new run that should be assigned with new id. So the
df['rleid_output'] = [1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 3]
Then, for each group defined by rleid_output
I'd like to check how many minutes (min
column) the run lasted giving me expected_output
column:
df['expected_output'] = [2, 2, 2, 2, 1, 1, 2, 3, 2, 3, 1, 3]
If it was R, I'd proceed as follows:
df <- data.frame(measurement_id = rep(1:2, each = 6),
min = rep(rep(1:3, each = 2), 2),
object = rep(LETTERS[1:2], 6),
var = c(1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1))
df %>%
group_by(object) %>%
mutate(rleid = data.table::rleid(measurement_id, var)) %>%
group_by(object, rleid) %>%
mutate(expected_output = last(min) - first(min) + 1)
So the main thing I need is R data.table::rleid
equivalent that would work with Python pd.DataFrame.groupby
clause. Any ideas how to solve this?
@Edit: new, updated example of data frame:
df = pd.DataFrame.from_dict({'measurement_id': np.repeat([1, 2], [6, 6]),
'min': np.concatenate([np.repeat([1, 2, 3], [2, 2, 2]),
np.repeat([1, 2, 3], [2, 2, 2])]),
'obj': list('AB' * 6),
'var': [1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 1, 1]})
df['rleid_output'] = [1, 1, 2, 1, 3, 2, 4, 3, 4, 3, 5, 3]
df['expected_output'] = [1, 2, 1, 2, 1, 1, 2, 3, 2, 3, 1, 3]
回答1:
Updated answer
The problem is that the min
column in each group of measurement_id, obj, var
should be maintained order. We can check this by group by on measurement_id, obj, var
and then checking if the difference in min
column is greater than 1
. If so, we mark it as a unique duration in expected_output
:
df['grouper'] = (df.groupby(['measurement_id', 'obj', 'var'])['min']
.apply(lambda x: x.diff().fillna(1).eq(1))
)
df['expected_output'] = (
df.groupby(['measurement_id', 'obj', 'var'])['grouper'].transform('sum').astype(int)
)
df = df.drop(columns='grouper')
measurement_id min obj var expected_output
0 1 1 A 1 1
1 1 1 B 2 2
2 1 2 A 2 1
3 1 2 B 2 2
4 1 3 A 1 1
5 1 3 B 1 1
6 2 1 A 2 2
7 2 1 B 1 3
8 2 2 A 2 2
9 2 2 B 1 3
10 2 3 A 1 1
11 2 3 B 1 3
Old answer, following OP's logic
We can achieve this by using GroupBy.diff
to get your rleid_output
, basically a unique identifier each time var
changes for each measurement_id
& obj
After that using GroupBy.nunique
to measure the amount of minutes
:
rleid_output = df.groupby(['measurement_id', 'obj'])['var'].diff().abs().bfill()
df['expected_output'] = (df.groupby(['measurement_id', 'obj', rleid_output])['min']
.transform('nunique'))
measurement_id min obj var expected_output
0 1 1 A 1 2
1 1 1 B 2 2
2 1 2 A 1 2
3 1 2 B 2 2
4 1 3 A 2 1
5 1 3 B 1 1
6 2 1 A 2 2
7 2 1 B 1 3
8 2 2 A 2 2
9 2 2 B 1 3
10 2 3 A 1 1
11 2 3 B 1 3
回答2:
To mimic the behaviour of R rleid
function one can firstly create an artificial column that checks if current value has changed in comparison to previous one or not. In this case, we should do this on grouped var
Series:
var_grpd = df.groupby(['measurement_id', 'obj'])['var']
df['tmp'] = (var_grpd.shift(0) != var_grpd.shift(1))
Then, we can use this artificial tmp
column to obtain rleid_output2
. After that, tmp
column is no longer needed.
df['rleid_output2'] = df.groupby('obj')['tmp'].cumsum().astype(int)
df.drop('tmp', axis = 1, inplace = True)
Finally, to check how many minutes var
value lasts, we can calculate the difference between last and first minute within a group.
df['expected_output2'] = df.groupby(['obj', 'rleid_output2'])['min'] \
.transform(lambda x: x.iat[-1] - x.iat[0] + 1)
.iat
is similar to .iloc
but allows us to access single value in DataFrame
or Series
.
来源:https://stackoverflow.com/questions/59501437/r-group-by-rleid-equivalent-in-python