问题
I have the following dataframe DF describing people that have worked on a project on certain dates:
ID ProjectName StartDate
1 Health 3/1/06 18:20
2 Education 2/1/07 15:30
1 Education 5/3/09 9:00
3 Wellness 4/1/10 12:00
2 Health 6/1/11 14:20
The goal is to find the first project corresponding to each ID. For example the expected output would be as follows:
ID ProjectName StartDate
1 Health 3/1/06 18:20
2 Education 2/1/07 15:30
3 Wellness 4/1/10 12:00
So far I have done the following to get the first StartDate for each ID:
sub <- ddply(DF, .(ID), summarise, st = min(as.POSIXct(StartDate)));
After this, I need to match each row in sub with the original DF and extract the projects corresponding to that ID and StartDate. This can be done in a loop for each row in sub. However, my dataset is very large and I would like to know if there is an efficient way to do this matching and extract this subset from DF.
回答1:
Here's a data.table
solution, which ought to be pretty efficient.
DF <- data.frame(ID=c(1,2,1,3,2,1), ProjectName=c('Health', 'Education', 'Education', 'Wellness', 'Health', 'Health'),
StartDate=c('3/1/06 18:20', '2/1/07 15:30', '5/3/09 9:00', '4/1/10 12:00', '6/1/11 14:20', '1/1/06 11:10'))
Note that I've modified your data, adding another element at the end, so the dates are no longer sorted. Thus the output differs.
d <- as.data.table(DF)
# Order by StartDate and take the first ID.
# Assumes that your dates are month/day/year.
d[order(as.POSIXct(StartDate, format="%m/%d/%y %H:%M"))][,.SD[1,],by=ID]
## ID ProjectName StartDate
## 1: 1 Health 1/1/06 11:10
## 2: 2 Education 2/1/07 15:30
## 3: 3 Wellness 4/1/10 12:00
If your dates are already sorted (as in your example), this suffices:
d[,.SD[1,],by=ID]
回答2:
This is fairly straightforward using match
because match
returns:
a vector of the positions of first matches of its first argument in its second
So all you need to do is sort by date, then use unique
to get one instance of each ID and match
to find the first position. Thanks to @MatthewLunberg for providing a reproducible example of your data:
DF <- DF[ order(as.POSIXct(DF$StartDate, format="%m/%d/%y %H:%M")) , ]
DF[ match( unique( DF$ID ) , DF$ID ) , ]
# ID ProjectName StartDate
#6 1 Health 1/1/06 11:10
#2 2 Education 2/1/07 15:30
#4 3 Wellness 4/1/10 12:00
One advantage is that it retains the rownumbers of the original dataframe before resorting. I do not know if this could be useful to you.
回答3:
Here is a base R solution
dat <- data.frame(
ID=c(1,2,1,3,2),
PRJ=c("H","E","E", "W", "H"),
START=strptime(
c(
"3/1/06 18:20", "2/1/07 15:30", "5/3/09 9:00",
"4/1/10 12:00","6/1/11 14:20"),
"%d/%m/%y %H:%M")
)
min_date <- function(x) {x[which.min(x$START), ]}
s <- split(dat, dat$ID) # split
a <- lapply(s, FUN=min_date) # apply
do.call("rbind", a) # combine
which results to
ID PRJ START
1 1 H 2006-01-03 18:20:00
2 2 E 2007-01-02 15:30:00
3 3 W 2010-01-04 12:00:00
However, the order-match solution from @SimonO101 is much faster than this.
Just for the fun of it, here is yet another solution using sqldf
:
sqldf("select * from dat group by ID having START=min(START)")
回答4:
And to round it off, here is a solution based on the plyr
package. I have added an extra column to make it easier for textConnection
to read the data.
dfProjects = as.data.frame(read.table(textConnection("ID ProjectName Date Time
1 Health 3/1/06 18:20
2 Education 2/1/07 15:30
1 Education 5/3/09 9:00
3 Wellness 4/1/10 12:00
2 Health 6/1/11 14:20"), header = TRUE))
ddply(within(dfProjects, dfProjects[order(
as.POSIXct(paste(Date, Time), format = "%m/%d/%y %H:%M")), ]),
.(ID), function(dataFrame) dataFrame[1, ])
来源:https://stackoverflow.com/questions/16256342/matching-multiple-date-values-in-r