I have a data frame in R where the rows represent events, and one column is the date of the event. The thing the event is happening to is described by an ID column. So for e
It's probably a character flaw but I sometimes resist picking up new packages. The "base R" functions can often do the job. In this case I think the alue of the dplyr
package shows through since I stumbled in creating a good solution since the ave
function returned a character value for a logical test, which I still don't understand. So I think dplyr is a real gem. And if I could I'd like to insist that any upvotes be preceded by an upvote to akrun's answer. (It's hard to believe this hasn't already been asked and answered on SO.)
Anyway:
> df[ as.logical(
ave(df$date, df$ID, FUN=function(d) as.Date(d , '%m/%d/%Y') ==
max(as.Date(d, '%m/%d/%Y'))))
, ]
ID date
2 1 03/14/2001
6 2 02/01/2008
7 3 08/22/2011
I thought this should work (fail) :
> df[ ave(df$date, df$ID, FUN=function(d) as.Date(d , '%m/%d/%Y') ==max(as.Date(d, '%m/%d/%Y'))) , ]
ID date
NA NA <NA>
NA.1 NA <NA>
NA.2 NA <NA>
NA.3 NA <NA>
NA.4 NA <NA>
NA.5 NA <NA>
NA.6 NA <NA>
NA.7 NA <NA>
NA.8 NA <NA>
Here's another base R solution that worked the first time with no surprises:
> do.call( rbind, by(df, df$ID, function(d) d[ which.max(as.Date(d$date, '%m/%d/%Y')), ] ) )
ID date
1 1 03/14/2001
2 2 02/01/2008
3 3 08/22/2011
Here's one inspired by @rawr's notion of taking the last one from an ordered subset:
> do.call( rbind, by(df, df$ID, function(d) tail( d[ order(as.Date(d$date, '%m/%d/%Y')), ] ,1)) )
ID date
1 1 03/14/2001
2 2 02/01/2008
3 3 08/22/2011
You can try
library(dplyr)
df %>%
group_by(ID) %>%
slice(which.max(as.Date(date, '%m/%d/%Y')))
df <- data.frame(ID= rep(1:3, each=3), date=c('02/20/1989',
'03/14/2001', '02/25/1990', '04/20/2002', '02/04/2005', '02/01/2008',
'08/22/2011','08/20/2009', '08/25/2010' ), stringsAsFactors=FALSE)
I've never processed any data in R without plyr
!
library(plyr)
ddply(df, .(ID), summarize, most_recent = max(as.Date(date, '%m/%d/%Y')))
ID most_recent
1 1 2001-03-14
2 2 2008-02-01
3 3 2011-08-22
Or you can order the dates and
df <- data.frame(ID= rep(1:3, each=3), date=c('02/20/1989',
'03/14/2001', '02/25/1990', '04/20/2002', '02/04/2005', '02/01/2008',
'08/22/2011','08/20/2009', '08/25/2010' ), stringsAsFactors=FALSE)
df$date <- as.Date(df$date, '%m/%d/%Y')
## make sure to order by both `ID` and `date` as Frank mentions in comments
## since the dates may be overlapping among IDs
df <- df[with(df, order(ID, date)), ]
1) select the last
df[cumsum(table(df$ID)), ]
# ID date
# 2 1 2001-03-14
# 6 2 2008-02-01
# 7 3 2011-08-22
2) or remove the duplicates
df[!duplicated(df$ID, fromLast = TRUE), ]
# ID date
# 2 1 2001-03-14
# 6 2 2008-02-01
# 7 3 2011-08-22
this data brought to you by @akrun
For any solution, you might as well correct your date variable first, as shown by @akrun:
df$date <- as.Date(df$date, '%m/%d/%Y')
Base R
df[
tapply(1:nrow(df),df$ID,function(ii) ii[which.max(df$date[ii])])
,]
This uses a selection of row numbers to subset the data. You can see the selection by running the middle line (between the []
s) on its own.
Data.table
Similar to @rawr's:
require(data.table)
DT <- data.table(df)
unique(DT[order(date)], by="ID", fromLast=TRUE)
# or
unique(DT[order(-date)], by="ID")