问题
I have a data frame in R , fairly large 600 rows/observations
one column is patientId
NOT in numeric form,e.g ju89, ju87, so it's a factor column
one column is remission
1/0 where 1 means remission 0 means not remission
one column is timefromdiagnosis
now, from time from diagnosis patients go from 1 to 0, 0 to 0, 0 to 1 or 1 to 1
I want to add a column to the data frame where it is
- 1 when a patient has 0 in remission
- 2 when precisely a patient has 1 in remission and the last time he had 0 OR has 1 in remission and the last time he had 1 in remission OR has 1 in remission and it is his first observation
- 3 when a patient has 1 remission and the last 2 or more times had 1 in remission
I looked into doing this with cum sum in plyr
, but it doesnt fit what i want to do or its not very clear how to adapt
the data frame is already sorted so that patient id's are adjacent to each other and for each patient time from diagnosis increases as you read down the data frame
I am unable to supply the dataframe due to confidentiality but here it what it looks like to clarify things
remission timefromdiag patientid ...(other variables)
This is the data with which I am starting:
patientId timefromdiagnosis remission
ju67 1.2 1
ju67 1.6 0
ju67 3 0
ju88 1.5 1
ju88 2 1
ju23 1.9 1
ju23 5 0
And here is what I want to get, disease stage is the column i want:
patientId timefromdiagnosis remission disease stage
ju67 1.2 1 2
ju67 1.6 0 1
ju67 3 0 1
ju88 1.5 1 2
ju88 2 1 2
ju23 1.9 1 2
ju23 5 0 1
ju38 1.7 1 2
ju38 1.9 1 2
ju38 3 1 3
ju38 4 1 3
ju38 5 0 1
Note how patient ju38
reaches 3 because he has had 3 consecutive remissions including the time now (remission last two times and now), he then stays at 3 because he simply has another remission, he then goes to disease stage 1 because he has a 0 in remission.
patient ju88
has remission at t=2 and has had remission at last time t=1.5 but this is only two consecutive remissions including t=2, so he is at disease stage 2
patient ju23
at t=1.9, has a 1 in remission and it is the first observation on him so he satisfies critieria for disease stage 2, if he had a 0 in remission he would be at disease stage 1
回答1:
You're using the number of consecutive periods for which a patient has been in remission, resetting that counter whenever a patient comes out of remission. I think the run-length encoding of the remission variable is therefore of interest. You can compute it with the rle
function:
dat$diseaseStage <- ave(dat$remission, dat$patientId, FUN=function(x) {
ret <- unlist(lapply(rle(x)$length, function(y) c(rep(2, min(2, y)), rep(3, max(0, y-2)))))
ret[x == 0] <- 1
ret
})
dat
# patientId timefromdiagnosis remission diseaseStage
# 1 ju67 1.2 1 2
# 2 ju67 1.6 0 1
# 3 ju67 3.0 0 1
# 4 ju88 1.5 1 2
# 5 ju88 2.0 1 2
# 6 ju23 1.9 1 2
# 7 ju23 5.0 0 1
# 8 ju38 1.7 1 2
# 9 ju38 1.9 1 2
# 10 ju38 3.0 1 3
# 11 ju38 4.0 1 3
# 12 ju38 5.0 0 1
Note that this works in the more complicated case where a patient comes into and out of remission multiple times:
dat2 <- data.frame(patientId=rep("x", 12), remission=rep(c(1, 0, 1, 0), each=3))
Using the same function, we get:
# patientId remission diseaseStage
# 1 x 1 2
# 2 x 1 2
# 3 x 1 3
# 4 x 0 1
# 5 x 0 1
# 6 x 0 1
# 7 x 1 2
# 8 x 1 2
# 9 x 1 3
# 10 x 0 1
# 11 x 0 1
# 12 x 0 1
Note that it's insufficient to use cumsum
in this case because it won't pick up on the fact that we came out of remission in lines 4-6.
来源:https://stackoverflow.com/questions/29796545/r-cumulative-sum-function-conditions