Matching multiple date values in R

问题

I have the following dataframe DF describing people that have worked on a project on certain dates:

ID    ProjectName    StartDate 
1       Health        3/1/06 18:20
2       Education     2/1/07 15:30
1       Education     5/3/09 9:00
3       Wellness      4/1/10 12:00
2       Health        6/1/11 14:20

The goal is to find the first project corresponding to each ID. For example the expected output would be as follows:

ID    ProjectName    StartDate 
1       Health        3/1/06 18:20
2       Education     2/1/07 15:30
3       Wellness      4/1/10 12:00

So far I have done the following to get the first StartDate for each ID:

sub <- ddply(DF, .(ID), summarise, st = min(as.POSIXct(StartDate)));

After this, I need to match each row in sub with the original DF and extract the projects corresponding to that ID and StartDate. This can be done in a loop for each row in sub. However, my dataset is very large and I would like to know if there is an efficient way to do this matching and extract this subset from DF.

回答1:

Here's a data.table solution, which ought to be pretty efficient.

DF <- data.frame(ID=c(1,2,1,3,2,1), ProjectName=c('Health', 'Education', 'Education', 'Wellness', 'Health', 'Health'),
             StartDate=c('3/1/06 18:20', '2/1/07 15:30', '5/3/09 9:00', '4/1/10 12:00', '6/1/11 14:20', '1/1/06 11:10'))

Note that I've modified your data, adding another element at the end, so the dates are no longer sorted. Thus the output differs.

d <- as.data.table(DF)

# Order by StartDate and take the first ID.
# Assumes that your dates are month/day/year.

d[order(as.POSIXct(StartDate, format="%m/%d/%y %H:%M"))][,.SD[1,],by=ID]
##    ID ProjectName    StartDate
## 1:  1      Health 1/1/06 11:10
## 2:  2   Education 2/1/07 15:30
## 3:  3    Wellness 4/1/10 12:00

If your dates are already sorted (as in your example), this suffices:

d[,.SD[1,],by=ID]

回答2:

This is fairly straightforward using match because match returns:

a vector of the positions of first matches of its first argument in its second

So all you need to do is sort by date, then use unique to get one instance of each ID and match to find the first position. Thanks to @MatthewLunberg for providing a reproducible example of your data:

DF <- DF[ order(as.POSIXct(DF$StartDate, format="%m/%d/%y %H:%M")) , ]
DF[ match( unique( DF$ID ) , DF$ID ) , ]
#  ID ProjectName    StartDate
#6  1      Health 1/1/06 11:10
#2  2   Education 2/1/07 15:30
#4  3    Wellness 4/1/10 12:00

One advantage is that it retains the rownumbers of the original dataframe before resorting. I do not know if this could be useful to you.

回答3:

Here is a base R solution

dat <- data.frame(
    ID=c(1,2,1,3,2), 
    PRJ=c("H","E","E", "W", "H"), 
    START=strptime(
      c(
        "3/1/06 18:20", "2/1/07 15:30", "5/3/09 9:00",
        "4/1/10 12:00","6/1/11 14:20"), 
      "%d/%m/%y %H:%M")
    )
min_date <- function(x) {x[which.min(x$START), ]}
s <- split(dat, dat$ID) # split
a <- lapply(s, FUN=min_date) # apply
do.call("rbind", a) # combine

which results to

  ID PRJ               START
1  1   H 2006-01-03 18:20:00
2  2   E 2007-01-02 15:30:00
3  3   W 2010-01-04 12:00:00

However, the order-match solution from @SimonO101 is much faster than this.

Just for the fun of it, here is yet another solution using sqldf:

sqldf("select * from dat group by ID having START=min(START)")

回答4:

And to round it off, here is a solution based on the plyr package. I have added an extra column to make it easier for textConnection to read the data.

dfProjects = as.data.frame(read.table(textConnection("ID    ProjectName    Date Time 
  1       Health        3/1/06 18:20
  2       Education     2/1/07 15:30
  1       Education     5/3/09 9:00
  3       Wellness      4/1/10 12:00
  2       Health        6/1/11 14:20"), header = TRUE))
ddply(within(dfProjects, dfProjects[order(
  as.POSIXct(paste(Date, Time), format = "%m/%d/%y %H:%M")), ]), 
      .(ID), function(dataFrame) dataFrame[1, ])

来源：https://stackoverflow.com/questions/16256342/matching-multiple-date-values-in-r

标签

matching

plyr