问题
We have time series data in which repeated observations were measured for several subjects. I would like to calculate the number of occasions in which the variable positive == 1
occurs for each subject (variable id
).
A second aim is to identify the maximum length of these runs of consecutive observations in which positive == 1
. For each subject there are likely to be multiple runs within the study period. Rather than calculating the maximum number of consecutive positive observations per subject, I would like to calculate the maximum run length within an individual run.
Here is a toy data set that illustrates the problem:
set.seed(1234)
test <- data.frame(id = rep(1:3, each = 10), positive = round(runif(30,0,1)))
test$run <- sequence(rle(test$positive)$lengths)
test$run_positive <- ifelse(test$positive == '0', '0', test$run)
test$episode <- ifelse(test$run_positive == '1', '1', '0')
count(test$episode)
x freq
1 0 25
2 1 5
The code above gets close to answering my first question in which I am attempting to count the number of positive episodes, however it is not conditioned by subject. This has the unfortunate effect of counting the last observation of Subject #1 and the first observation of Subject #2 in the same run. Can anyone help me develop code to condition this run length encoding by subject?
Secondly, how can one extract only the maximum run length for each run in which positive == 1
? I would like to add an additional column in which only the observations in which the maximum run length are recorded. For Subject #1, this would look like:
id positive run run_positive episode max_run
1 1 0 1 0 0 0
2 1 1 1 1 1 0
3 1 1 2 2 0 0
4 1 1 3 3 0 0
5 1 1 4 4 0 0
6 1 1 5 5 0 5
7 1 0 1 0 0 0
8 1 0 2 0 0 0
9 1 1 1 1 1 0
10 1 1 2 2 0 2
If anyone can come up with a method to do this I would be extremely grateful.
回答1:
I think this answers your first question:
aggregate(positive ~ id, data = test, FUN = sum)
id positive
1 1 7
2 2 4
3 3 4
This might answer your second question, but I would need to see the desired result for each id
to check:
set.seed(1234)
test <- data.frame(id = rep(1:3, each = 10), positive = round(runif(30,0,1)))
test$run <- sequence(rle(test$positive)$lengths)
test$run_positive <- ifelse(test$positive == '0', '0', test$run)
test$episode <- ifelse(test$run_positive == '1', '1', '0')
test$group <- paste(test$id*10, test$positive, sep='')
my.seq <- data.frame(rle(test$group)$lengths)
test$first <- unlist(apply(my.seq, 1, function(x) seq(1,x)))
test$last <- unlist(apply(my.seq, 1, function(x) seq(x,1,-1)))
test$max <- ifelse(test$last == 1 & test$positive==1, test$run, 0)
test
id positive run run_positive episode group first last max
1 1 0 1 0 0 100 1 1 0
2 1 1 1 1 1 101 1 5 0
3 1 1 2 2 0 101 2 4 0
4 1 1 3 3 0 101 3 3 0
5 1 1 4 4 0 101 4 2 0
6 1 1 5 5 0 101 5 1 5
7 1 0 1 0 0 100 1 2 0
8 1 0 2 0 0 100 2 1 0
9 1 1 1 1 1 101 1 2 0
10 1 1 2 2 0 101 2 1 2
11 2 1 3 3 0 201 1 2 0
12 2 1 4 4 0 201 2 1 4
13 2 0 1 0 0 200 1 1 0
14 2 1 1 1 1 201 1 1 1
15 2 0 1 0 0 200 1 1 0
16 2 1 1 1 1 201 1 1 1
17 2 0 1 0 0 200 1 4 0
18 2 0 2 0 0 200 2 3 0
19 2 0 3 0 0 200 3 2 0
20 2 0 4 0 0 200 4 1 0
21 3 0 5 0 0 300 1 5 0
22 3 0 6 0 0 300 2 4 0
23 3 0 7 0 0 300 3 3 0
24 3 0 8 0 0 300 4 2 0
25 3 0 9 0 0 300 5 1 0
26 3 1 1 1 1 301 1 4 0
27 3 1 2 2 0 301 2 3 0
28 3 1 3 3 0 301 3 2 0
29 3 1 4 4 0 301 4 1 4
30 3 0 1 0 0 300 1 1 0
来源:https://stackoverflow.com/questions/18669123/calculate-run-length-sequence-and-maximum-by-subject-id