问题
I have a data frame with several columns. Based on the column 'activity', I want to remove entire contiguous runs of a specific value, 'pt', but only when they occur immediately before or after a run of 'outside'.
In the simplified data below, there is one run where 'activity' is 'outside', and which have chunks of 'pt' before and after. These two 'pt' chunks should be removed.
activity dist
1 home 1
2 pt 2 # <- run of 'pt' before run of 'outside': remove
3 pt 3 # <-
4 pt 4 # <-
5 outside 5
6 outside 6
7 pt 7 # <- run of 'pt' after run of 'outside': remove
8 pt 8 # <-
9 work 9
10 pt 10
11 pt 11
12 home 12
Thus, the desired output is:
activity dist
1 home 1
2 outside 5
3 outside 6
4 work 9
5 pt 10
6 pt 11
7 home 12
How can this be achieved?
dput
of data:
structure(list(activity = c("home", "pt", "pt", "pt", "outside", "outside", "pt", "pt", "work", "pt", "pt", "home"),
dist = 1:12),
class = "data.frame", row.names = c(NA, -12L))
回答1:
You may use some convenience functions from data.table
package: rleid
to "[g]enerate run-length type group id", and shift
to get the values before and after the focal index in a vector.
library(data.table)
setDT(d)
d[ , r := rleid(activity)]
d[!(r %in% r[activity == "pt" & shift(activity, type = "lead") == "outside" |
shift(activity) == "outside" & activity == "pt"])]
# activity dist r
# 1: home 1 1
# 2: outside 5 3
# 3: outside 6 3
# 4: work 9 5
# 5: pt 10 6
# 6: pt 11 6
# 7: home 12 7
Explanation:
Coerce your data.frame
to a data.table
(setDT(d)
). Create run length index of 'activity' (rleid
). Check if current value is 'pt' and next value is 'outside' (activity == "pt" & shift(activity, type = "lead") == "outside"
), or (|
) if current value is 'pt' and previous value is 'outside' (activity == "pt" & shift(activity) == "outside"
).
Where this condition is TRUE
, grab the run groups to be removed (r[<condition>]
). Check if run are in the groups to be removed (r %in% <run groups to be removed>
). If so, do not (!
) keep these rows when indexing the data (d[<condition>]
)
base
alternative using rle
.
The values of runs of 'pt' before or after 'outside' are replaced with NA
. The rle is converted back to a vector (inverse.rle
) and rows with NA
are removed (na.omit
).
Obviously, if there are rows with NA
in the original data set which you want to keep, you need to use another value for replacement.
with(rle(d$activity),
values[c(which(head(values, -1) == "pt" & tail(values, -1) == "outside"),
which(head(values, -1) == "outside" & tail(values, -1) == "pt") + 1)]) <- NA
d$activity = inverse.rle(r)
na.omit(d)
来源:https://stackoverflow.com/questions/62454188/delete-runs-of-certain-value-before-and-after-specific-value