select last observation from longitudinal data

前端未结

关注

 6  1187

I have a data set with several time assessments for each participant. I want to select the last assessment for each participant. My dataset looks like this:


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  难免孤独        
                
              
                            
                2020-12-28 10:02
              
            
            
                                                                       
I've been trying to use split and tapply a bit more to become more acquainted with them.  I know this question have been answered already but I thought I'd add another solotuion using split (pardon the ugliness; I'm more than open to feedback for improvement; thought maybe there was a use to tapply to lessen the code):

sdf <-with(df, split(df, ID))
max.week <- sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))
data.frame(t(mapply(function(x, y) y[x, ], max.week, sdf)))


I also figured why we have 7 answers here it was ripe for a benchmark.  The results may surprise you (using rbenchmark with R2.14.1 on a Win 7 machine):

# library(rbenchmark)
# benchmark(
#     DATA.TABLE= {dt <- data.table(df, key="ID")
#         dt[, .SD[which.max(outcome),], by=ID]},
#     DO.CALL={do.call("rbind", 
#         by(df, INDICES=df$ID, FUN=function(DF) DF[which.max(DF$week),]))},
#     PLYR=ddply(df, .(ID), function(X) X[which.max(X$week), ]),
#     SPLIT={sdf <-with(df, split(df, ID))
#         max.week <- sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))
#         data.frame(t(mapply(function(x, y) y[x, ], max.week, sdf)))},
#     MATCH.INDEX=df[rev(rownames(df)),][match(unique(df$ID), rev(df$ID)), ],
#     AGGREGATE=df[cumsum(aggregate(week ~ ID, df, which.max)$week), ],
#     #WHICH.MAX.INDEX=df[sapply(unique(df$ID), function(x) which.max(x==df$ID)), ],
#     BRYANS.INDEX = df[cumsum(as.numeric(lapply(split(df$week, df$ID), 
#         which.max))), ],
#     SPLIT2={sdf <-with(df, split(df, ID))
#         df[cumsum(sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))),
#         ]},
#     TAPPLY=df[tapply(seq_along(df$ID), df$ID, function(x){tail(x,1)}),],
# columns = c( "test", "replications", "elapsed", "relative", "user.self","sys.self"), 
# order = "test", replications = 1000, environment = parent.frame())

          test replications elapsed  relative user.self sys.self
6    AGGREGATE         1000    4.49  7.610169      2.84     0.05
7 BRYANS.INDEX         1000    0.59  1.000000      0.20     0.00
1   DATA.TABLE         1000   20.28 34.372881     11.98     0.00
2      DO.CALL         1000    4.67  7.915254      2.95     0.03
5  MATCH.INDEX         1000    1.07  1.813559      0.51     0.00
3         PLYR         1000   10.61 17.983051      5.07     0.00
4        SPLIT         1000    3.12  5.288136      1.81     0.00
8       SPLIT2         1000    1.56  2.644068      1.28     0.00
9       TAPPLY         1000    1.08  1.830508      0.88     0.00


Edit1:  I omitted the WHICH MAX solution as it does not return the correct results and returned an AGGREGATE solution as well that I wanted to use (compliments of Bryan Goodrich) and an updated version of split, SPLIT2, using cumsum (I liked that move).

Edit 2:  Dason also chimed in with a tapply solution I threw into the test that fared pretty well too.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  半阙折子戏        
                
              
                            
                2020-12-28 10:03
              
            
            
                                                                       
I can play this game. I ran some benchmarks on differences between lapply, sapply, and by, among other things. It appears to me that the more you're in control of data types and the more basic the operation, the faster it is (e.g., lapply is generally faster than sapply, and as.numeric(lapply(...)) is going to be faster, also). With that in mind, this produced the same results as above and may be faster than the rest.

df[cumsum(as.numeric(lapply(split(df$week, df$id), which.max))), ]


Explanation: we only want which.max on the week per each id. That handles the contents of lapply. We only need the vector of these relative points, so make it numeric. The result is the vector (3, 5, 5). We need to add the positions of the prior maxes. This is accomplished with cumsum. 

It should be noted, this solution is not general when I use cumsum. It may require that prior to execution we sort the frame on id and week. I hope you understand why (and know how to use with(df, order(id, week)) in the row index to achieve that). In any case, it may still fail if we don't have a unique max, because which.max only takes the first one. Therefore, my solution is a bit question begging, but that goes without saying. We're trying to extract very specific information for a very specific example. Our solutions can't be general (even though the methods are important to understand generally). 

I'll leave it to trinker to update his comparisons!  
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  栀梦        
                
              
                            
                2020-12-28 10:07
              
            
            
                                                                       
Here is one base-R approach:

do.call("rbind", 
        by(df, INDICES=df$ID, FUN=function(DF) DF[which.max(DF$week), ]))
  ID week outcome
1  1    6      42
4  4   12      85
9  9   12      84




Alternatively, the data.table package offers a succinct and expressive language for performing data frame manipulations of this type:

library(data.table)
dt <- data.table(df, key="ID")

dt[, .SD[which.max(outcome), ], by=ID] 
#      ID week outcome
# [1,]  1    6      42
# [2,]  4   12      85
# [3,]  9   12      84

# Same but much faster. 
# (Actually, only the same as long as there are no ties for max(outcome)..)
dt[ dt[,outcome==max(outcome),by=ID][[2]] ]   # same, but much faster.

# If there are ties for max(outcome), the following will still produce
# the same results as the method using .SD, but will be faster
i1 <- dt[,which.max(outcome), by=ID][[2]]
i2 <- dt[,.N, by=ID][[2]]
dt[i1 + cumsum(i2) - i2,]




Finally, here is a plyr-based solution

library(plyr)

ddply(df, .(ID), function(X) X[which.max(X$week), ])
#   ID week outcome
# 1  1    6      42
# 2  4   12      85
# 3  9   12      84

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉梦人生        
                
              
                            
                2020-12-28 10:11
              
            
            
                                                                       
Another option in base: df[rev(rownames(df)),][match(unique(df$ID), rev(df$ID)), ]
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦如初夏        
                
              
                            
                2020-12-28 10:14
              
            
            
                                                                       
This answer uses the data.table package. It should be very fast, even with larger data sets.

setkey(DT, ID, week)              # Ensure it's sorted.
DT[DT[, .I[.N], by = ID][, V1]]


Explanation: .I is an integer vector holding the row locations for the group (in this case the group is ID). .N is a length-one integer vector containing the number of rows in the group. So what we're doing here is to extract the location of the last row for each group, using the "inner" DT[.], using the fact that the data is sorted according to ID and week. Afterwards we use that to subset the "outer" DT[.].

For comparison (because it's not posted elsewhere), here's how you can generate the original data so that you can run the code:

DT <- 
  data.table(
    ID = c(rep(1, 3), rep(4, 5), rep(9, 5)),
    week = c(2,4,6, 2,6,9,9,12, 2,4,6,9,12), 
    outcome = c(14,28,42, 14,46,64,71,85, 14,28,51,66,84))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  心在旅途        
                
              
                            
                2020-12-28 10:15
              
            
            
                                                                       
If you're just looking for the last observation per person ID, then a simple two line code should do it. I am up always for simple base solution when possible while it is always great to have more than one ways to solve a problem.

dat[order(dat$ID,dat$Week),]  # Sort by ID and week
dat[!duplicated(dat$ID, fromLast=T),] # Keep last observation per ID

   ID Week Outcome
3   1    6      42
8   4   12      85
13  9   12      84

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复