r cumulative sum function, conditions

问题

I have a data frame in R , fairly large 600 rows/observations

one column is patientId NOT in numeric form,e.g ju89, ju87, so it's a factor column

one column is remission 1/0 where 1 means remission 0 means not remission

one column is timefromdiagnosis

now, from time from diagnosis patients go from 1 to 0, 0 to 0, 0 to 1 or 1 to 1

I want to add a column to the data frame where it is

1 when a patient has 0 in remission
2 when precisely a patient has 1 in remission and the last time he had 0 OR has 1 in remission and the last time he had 1 in remission OR has 1 in remission and it is his first observation
3 when a patient has 1 remission and the last 2 or more times had 1 in remission

I looked into doing this with cum sum in plyr, but it doesnt fit what i want to do or its not very clear how to adapt

the data frame is already sorted so that patient id's are adjacent to each other and for each patient time from diagnosis increases as you read down the data frame

I am unable to supply the dataframe due to confidentiality but here it what it looks like to clarify things

remission timefromdiag patientid ...(other variables)

This is the data with which I am starting:

patientId  timefromdiagnosis  remission
ju67       1.2                1
ju67       1.6                0
ju67       3                  0
ju88       1.5                1
ju88       2                  1
ju23       1.9                1
ju23       5                  0

And here is what I want to get, disease stage is the column i want:

patientId  timefromdiagnosis  remission  disease stage
ju67       1.2                1          2
ju67       1.6                0          1
ju67       3                  0          1
ju88       1.5                1          2
ju88       2                  1          2
ju23       1.9                1          2
ju23       5                  0          1
ju38       1.7                1          2
ju38       1.9                1          2 
ju38       3                  1          3
ju38       4                  1          3
ju38       5                  0          1

Note how patient ju38 reaches 3 because he has had 3 consecutive remissions including the time now (remission last two times and now), he then stays at 3 because he simply has another remission, he then goes to disease stage 1 because he has a 0 in remission.

patient ju88 has remission at t=2 and has had remission at last time t=1.5 but this is only two consecutive remissions including t=2, so he is at disease stage 2

patient ju23 at t=1.9, has a 1 in remission and it is the first observation on him so he satisfies critieria for disease stage 2, if he had a 0 in remission he would be at disease stage 1

回答1:

You're using the number of consecutive periods for which a patient has been in remission, resetting that counter whenever a patient comes out of remission. I think the run-length encoding of the remission variable is therefore of interest. You can compute it with the rle function:

dat$diseaseStage <- ave(dat$remission, dat$patientId, FUN=function(x) {
  ret <- unlist(lapply(rle(x)$length, function(y) c(rep(2, min(2, y)), rep(3, max(0, y-2)))))
  ret[x == 0] <- 1
  ret
})
dat
#    patientId timefromdiagnosis remission diseaseStage
# 1       ju67               1.2         1            2
# 2       ju67               1.6         0            1
# 3       ju67               3.0         0            1
# 4       ju88               1.5         1            2
# 5       ju88               2.0         1            2
# 6       ju23               1.9         1            2
# 7       ju23               5.0         0            1
# 8       ju38               1.7         1            2
# 9       ju38               1.9         1            2
# 10      ju38               3.0         1            3
# 11      ju38               4.0         1            3
# 12      ju38               5.0         0            1

Note that this works in the more complicated case where a patient comes into and out of remission multiple times:

dat2 <- data.frame(patientId=rep("x", 12), remission=rep(c(1, 0, 1, 0), each=3))

Using the same function, we get:

#    patientId remission diseaseStage
# 1          x         1            2
# 2          x         1            2
# 3          x         1            3
# 4          x         0            1
# 5          x         0            1
# 6          x         0            1
# 7          x         1            2
# 8          x         1            2
# 9          x         1            3
# 10         x         0            1
# 11         x         0            1
# 12         x         0            1

Note that it's insufficient to use cumsum in this case because it won't pick up on the fact that we came out of remission in lines 4-6.

来源：https://stackoverflow.com/questions/29796545/r-cumulative-sum-function-conditions

标签

cumsum