Select row with most recent date by group

后端 未结 5 1885
南方客
南方客 2020-12-01 14:42

I have a data frame in R where the rows represent events, and one column is the date of the event. The thing the event is happening to is described by an ID column. So for e

相关标签:
5条回答
  • 2020-12-01 15:15

    It's probably a character flaw but I sometimes resist picking up new packages. The "base R" functions can often do the job. In this case I think the alue of the dplyr package shows through since I stumbled in creating a good solution since the ave function returned a character value for a logical test, which I still don't understand. So I think dplyr is a real gem. And if I could I'd like to insist that any upvotes be preceded by an upvote to akrun's answer. (It's hard to believe this hasn't already been asked and answered on SO.)

    Anyway:

    > df[ as.logical(
            ave(df$date, df$ID, FUN=function(d) as.Date(d , '%m/%d/%Y') == 
                                                 max(as.Date(d, '%m/%d/%Y'))))
          , ]
      ID       date
    2  1 03/14/2001
    6  2 02/01/2008
    7  3 08/22/2011
    

    I thought this should work (fail) :

    > df[ ave(df$date, df$ID, FUN=function(d) as.Date(d , '%m/%d/%Y') ==max(as.Date(d, '%m/%d/%Y'))) , ]
         ID date
    NA   NA <NA>
    NA.1 NA <NA>
    NA.2 NA <NA>
    NA.3 NA <NA>
    NA.4 NA <NA>
    NA.5 NA <NA>
    NA.6 NA <NA>
    NA.7 NA <NA>
    NA.8 NA <NA>
    

    Here's another base R solution that worked the first time with no surprises:

    > do.call( rbind, by(df, df$ID, function(d) d[ which.max(as.Date(d$date, '%m/%d/%Y')), ] ) )
      ID       date
    1  1 03/14/2001
    2  2 02/01/2008
    3  3 08/22/2011
    

    Here's one inspired by @rawr's notion of taking the last one from an ordered subset:

    > do.call( rbind, by(df, df$ID, function(d) tail( d[ order(as.Date(d$date, '%m/%d/%Y')), ] ,1)) )
      ID       date
    1  1 03/14/2001
    2  2 02/01/2008
    3  3 08/22/2011
    
    0 讨论(0)
  • 2020-12-01 15:25

    You can try

    library(dplyr)
    df %>% 
      group_by(ID) %>%
      slice(which.max(as.Date(date, '%m/%d/%Y')))
    

    data

    df <- data.frame(ID= rep(1:3, each=3), date=c('02/20/1989',
    '03/14/2001', '02/25/1990',  '04/20/2002', '02/04/2005', '02/01/2008',
    '08/22/2011','08/20/2009', '08/25/2010' ), stringsAsFactors=FALSE)
    
    0 讨论(0)
  • 2020-12-01 15:33

    I've never processed any data in R without plyr!

    library(plyr)
    ddply(df, .(ID), summarize, most_recent = max(as.Date(date, '%m/%d/%Y')))
    
       ID most_recent
    1  1  2001-03-14
    2  2  2008-02-01
    3  3  2011-08-22
    
    0 讨论(0)
  • 2020-12-01 15:34

    Or you can order the dates and

    df <- data.frame(ID= rep(1:3, each=3), date=c('02/20/1989',
                                                  '03/14/2001', '02/25/1990',  '04/20/2002', '02/04/2005', '02/01/2008',
                                                  '08/22/2011','08/20/2009', '08/25/2010' ), stringsAsFactors=FALSE)
    
    df$date <- as.Date(df$date, '%m/%d/%Y')
    
    ## make sure to order by both `ID` and `date` as Frank mentions in comments
    ## since the dates may be overlapping among IDs
    
    df <- df[with(df, order(ID, date)), ]
    

    1) select the last

    df[cumsum(table(df$ID)), ]
    
    #   ID       date
    # 2  1 2001-03-14
    # 6  2 2008-02-01
    # 7  3 2011-08-22
    

    2) or remove the duplicates

    df[!duplicated(df$ID, fromLast = TRUE), ]
    
    #   ID       date
    # 2  1 2001-03-14
    # 6  2 2008-02-01
    # 7  3 2011-08-22
    

    this data brought to you by @akrun

    0 讨论(0)
  • 2020-12-01 15:35

    For any solution, you might as well correct your date variable first, as shown by @akrun:

    df$date <- as.Date(df$date, '%m/%d/%Y')
    

    Base R

    df[
      tapply(1:nrow(df),df$ID,function(ii) ii[which.max(df$date[ii])])
    ,]
    

    This uses a selection of row numbers to subset the data. You can see the selection by running the middle line (between the []s) on its own.

    Data.table

    Similar to @rawr's:

    require(data.table)
    DT <- data.table(df)
    
    unique(DT[order(date)], by="ID", fromLast=TRUE)
    # or
    unique(DT[order(-date)], by="ID")
    
    0 讨论(0)
提交回复
热议问题