I have a data set that looks like this
id name year job job2
1 Jane 1980 Worker 0
1 Jane 1981 Manager 1
1 Jane 1982 Manage
Here is the succinct dplyr
solution for the same problem.
NOTE: Make sure that stringsAsFactors = FALSE
while reading in the data.
library(dplyr)
dat %>%
group_by(name, job) %>%
filter(job != "Boss" | year == min(year)) %>%
mutate(cumu_job2 = cumsum(job2))
Output:
id name year job job2 cumu_job2
1 1 Jane 1980 Worker 0 0
2 1 Jane 1981 Manager 1 1
3 1 Jane 1982 Manager 1 2
4 1 Jane 1983 Manager 1 3
5 1 Jane 1984 Manager 1 4
6 1 Jane 1985 Manager 1 5
7 1 Jane 1986 Boss 0 0
8 2 Bob 1985 Worker 0 0
9 2 Bob 1986 Worker 0 0
10 2 Bob 1987 Manager 1 1
11 2 Bob 1988 Boss 0 0
Explanation
cumu_job2
column.Here is a base solution using within
and ave
. We assume that the input is DF
and that the data is sorted as in the question.
DF2 <- within(DF, {
seq = ave(id, id, job, FUN = seq_along)
job2 = (job == "Manager") + 0
cumu_job2 = ave(job2, id, job, FUN = cumsum)
})
subset(DF2, job != 'Boss' | seq == 1, select = - seq)
REVISION: Now uses within
.
I think this does what you want, although the data must be sorted as you have presented it.
my.df <- read.table(text = '
id name year job job2
1 Jane 1980 Worker 0
1 Jane 1981 Manager 1
1 Jane 1982 Manager 1
1 Jane 1983 Manager 1
1 Jane 1984 Manager 1
1 Jane 1985 Manager 1
1 Jane 1986 Boss 0
1 Jane 1987 Boss 0
2 Bob 1985 Worker 0
2 Bob 1986 Worker 0
2 Bob 1987 Manager 1
2 Bob 1988 Boss 0
2 Bob 1989 Boss 0
2 Bob 1990 Boss 0
2 Bob 1991 Boss 0
2 Bob 1992 Boss 0
', header = TRUE, stringsAsFactors = FALSE)
my.seq <- data.frame(rle(my.df$job)$lengths)
my.df$cumu_job2 <- as.vector(unlist(apply(my.seq, 1, function(x) seq(1,x))))
my.df2 <- my.df[!(my.df$job=='Boss' & my.df$cumu_job2 != 1),]
my.df2$cumu_job2[my.df2$job != 'Manager'] <- 0
id name year job job2 cumu_job2
1 1 Jane 1980 Worker 0 0
2 1 Jane 1981 Manager 1 1
3 1 Jane 1982 Manager 1 2
4 1 Jane 1983 Manager 1 3
5 1 Jane 1984 Manager 1 4
6 1 Jane 1985 Manager 1 5
7 1 Jane 1986 Boss 0 0
9 2 Bob 1985 Worker 0 0
10 2 Bob 1986 Worker 0 0
11 2 Bob 1987 Manager 1 1
12 2 Bob 1988 Boss 0 0
@BrodieG's is way better:
The Data
dat <- read.table(text="id name year job job2
1 Jane 1980 Manager 1
1 Jane 1981 Manager 1
1 Jane 1982 Manager 1
1 Jane 1983 Manager 1
1 Jane 1984 Manager 1
1 Jane 1985 Manager 1
1 Jane 1986 Boss 0
1 Jane 1987 Boss 0
2 Bob 1985 Manager 1
2 Bob 1986 Manager 1
2 Bob 1987 Manager 1
2 Bob 1988 Boss 0
2 Bob 1989 Boss 0
2 Bob 1990 Boss 0
2 Bob 1991 Boss 0
2 Bob 1992 Boss 0", header=TRUE)
#The code:
inds1 <- rle(dat$job2)
inds2 <- cumsum(inds1[[1]])[inds1[[2]] == 1] + 1
ends <- cumsum(inds1[[1]])
starts <- c(1, head(ends + 1, -1))
inds3 <- mapply(":", starts, ends)
dat$id <- rep(1:length(inds3), sapply(inds3, length))
dat <- do.call(rbind, lapply(split(dat[, 1:5], dat$id ), function(x) {
if(x$job2[1] == 0){
x$cumu_job2 <- rep(0, nrow(x))
} else {
x$cumu_job2 <- 1:nrow(x)
}
x
}))
keeps <- dat$job2 > 0
keeps[inds2] <- TRUE
dat2 <- data.frame(dat[keeps, ], row.names = NULL)
dat2
## id name year job job2 cumu_job2
## 1 1 Jane 1980 Manager 1 1
## 2 1 Jane 1981 Manager 1 2
## 3 1 Jane 1982 Manager 1 3
## 4 1 Jane 1983 Manager 1 4
## 5 1 Jane 1984 Manager 1 5
## 6 1 Jane 1985 Manager 1 6
## 7 2 Jane 1986 Boss 0 0
## 8 3 Bob 1985 Manager 1 1
## 9 3 Bob 1986 Manager 1 2
## 10 3 Bob 1987 Manager 1 3
## 11 4 Bob 1988 Boss 0 0
Contributed by Matthew Dowle:
dt[, .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)],
by = list(name, job)]
Explanation
.SD
)Older versions:
You have two different split apply combines here. One to get the cumulative jobs, and the other to get the first row of boss status. Here is an implementation in data.table
where we basically do each analysis separately (well, kind of), and then collect everything in one place with rbind
. The main thing to note is the by=id
piece, which basically means the other expressions are evaluated for each id
grouping in the data, which was what you correctly noted was missing from your attempt.
library(data.table)
dt <- as.data.table(df)
dt[, cumujob:=0L] # add column, set to zero
dt[job2==1, cumujob:=cumsum(job2), by=id] # cumsum for manager time by person
rbind(
dt[job2==1], # this is just the manager portion of the data
dt[job2==0, head(.SD, 1), by=id] # get first bossdom row
)[order(id, year)] # order by id, year
# id name year job job2 cumujob
# 1: 1 Jane 1980 Manager 1 1
# 2: 1 Jane 1981 Manager 1 2
# 3: 1 Jane 1982 Manager 1 3
# 4: 1 Jane 1983 Manager 1 4
# 5: 1 Jane 1984 Manager 1 5
# 6: 1 Jane 1985 Manager 1 6
# 7: 1 Jane 1986 Boss 0 0
# 8: 2 Bob 1985 Manager 1 1
# 9: 2 Bob 1986 Manager 1 2
# 10: 2 Bob 1987 Manager 1 3
# 11: 2 Bob 1988 Boss 0 0
Note this assumes table is sorted by year within each id
, but if it isn't that's easy enough to fix.
Alternatively you could also achieve the same with:
ans <- dt[, .I[job != "Boss" | year == min(year)], by=list(name, job)]
ans <- dt[ans$V1]
ans[, cumujob := cumsum(job2), by=list(name,job)]
The idea is to basically get the row numbers where the condition matches (with .I
- internal variable) and then subset dt
on those row numbers (the $v1
part), then just perform the cumulative sum.