I have a data frame containing an ID, a start date and an end date. My data is ordered by ID, start, end (in this sequence).
Now I want all rows with the same ID hav
I did it slightly different to avoid deleting empty rows in the end:
smoothingEpisodes <- function (theData) {
curId <- theData[1, "ID"]
curStart <- theData[1, "START"]
curEnd <- theData[1, "END"]
theLength <- nrow(theData)
out.1 <- integer(length = theLength)
out.2 <- out.3 <- numeric(length = theLength)
j <- 1
for(i in 2:nrow(theData)) {
nextId <- theData[i, "ID"]
nextStart <- theData[i, "START"]
nextEnd <- theData[i, "END"]
if (curId != nextId | (curEnd + 1) < nextStart) {
out.1[j] <- curId
out.2[j] <- curStart
out.3[j] <- curEnd
j <- j + 1
curId <- nextId
curStart <- nextStart
curEnd <- nextEnd
} else {
curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
}
}
out.1[j] <- curId
out.2[j] <- curStart
out.3[j] <- curEnd
theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))
theOutput
}
quite a big improvement to my original version!
The first [without really thinking to hard about what you are trying to do] optimisation I would suggest is to allocate storage for theOutput
. At the moment, you are growing theOutput
at each iteration of the loop. In R that is an absolute no no!! That is something you never do, unless you like woefully slow code. R has to copy the object and expand it during each iteration and that is slow.
Looking at the code, we know that theOutput
needs to have nrow(theData) - 1
rows, and 3 columns. So create that before the loop starts:
theOutput <- data.frame(matrix(ncol = 3, nrow = nrow(theData) - 1))
then fill in this object during the loop:
theOutput[i, ] <- data.frame("ID" = curId, "START" = curStart, "END" = curEnd))
for example.
It isn't clear what START
and END
are? if these are numerics, then working with a matrix and not a data frame could also improve speed efficiency.
Also, creating a data frame each iteration is going to be slow. I can't time this without spending a lot of my own time, but you could just fill in the bits you want directly, without incurring the data.frame()
call during each iteration:
theOutput[i, "ID"] <- curId
theOutput[i, "START"] <- curStart
theOutput[i, "END"] <- curEnd
The best tip I can give you however, is to profile your code. See where the bottlenecks are and speed those up. Run your function on a smaller subset of the data; the size of which is sufficient to give you a bit of run-time to gather useful profiling data without having to wait for ages to get the profiling run completed. To profile in R, use Rprof()
:
Rprof(filename = "my_fun_profile.Rprof")
## run your function call here on a subset of the data
Rprof(NULL)
The you can look at the output using
summaryRprof("my_fun_profile.Rprof")
Hadley Wickham (@hadley) has a package to make this a bit easier. It is called profr. And as Dirk reminds me in the comments, there is also Luke Tierney's proftools package.
Edit: as the OP provided some test data I knocked up something quick to show the speed-up achieved by just following good loop practice:
smoothingEpisodes2 <- function (theData) {
curId <- theData[1, "ID"]
curStart <- theData[1, "START"]
curEnd <- theData[1, "END"]
nr <- nrow(theData)
out1 <- integer(length = nr)
out2 <- out3 <- numeric(length = nr)
for(i in 2:nrow(theData)) {
nextId <- theData[i, "ID"]
nextStart <- theData[i, "START"]
nextEnd <- theData[i, "END"]
if (curId != nextId | (curEnd + 1) < nextStart) {
out1[i-1] <- curId
out2[i-1] <- curStart
out3[i-1] <- curEnd
curId <- nextId
curStart <- nextStart
curEnd <- nextEnd
} else {
curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
}
}
out1[i] <- curId
out2[i] <- curStart
out3[i] <- curEnd
theOutput <- data.frame(ID = out1,
START = as.Date(out2, origin = "1970-01-01"),
END = as.Date(out3, origin = "1970-01-01"))
## drop empty
theOutput <- theOutput[-which(theOutput$ID == 0), ]
theOutput
}
Using the test dataset provide in object testData
, I get:
> res1 <- smoothingEpisodes(testData)
> system.time(replicate(100, smoothingEpisodes(testData)))
user system elapsed
1.091 0.000 1.131
> res2 <- smoothingEpisodes2(testData)
> system.time(replicate(100, smoothingEpisodes2(testData)))
user system elapsed
0.506 0.004 0.517
a 50% speed up. Not dramatic but simple to achieve just by not growing an object at each iteration.
Marcel, I thought I'd just try to improve your code a little. The version below is about 30x faster (from 3 seconds to 0.1 seconds)... The trick is to first extract the three columns to integer and double vectors.
As a side note, I try to use [[
where applicable, and try to keep integers as integers by writing j <- j + 1L
etc. That does not make any difference here, but sometimes coercing between integers and doubles can take quite some time.
smoothingEpisodes3 <- function (theData) {
theLength <- nrow(theData)
if (theLength < 2L) return(theData)
id <- as.integer(theData[["ID"]])
start <- as.numeric(theData[["START"]])
end <- as.numeric(theData[["END"]])
curId <- id[[1L]]
curStart <- start[[1L]]
curEnd <- end[[1L]]
out.1 <- integer(length = theLength)
out.2 <- out.3 <- numeric(length = theLength)
j <- 1L
for(i in 2:nrow(theData)) {
nextId <- id[[i]]
nextStart <- start[[i]]
nextEnd <- end[[i]]
if (curId != nextId | (curEnd + 1) < nextStart) {
out.1[[j]] <- curId
out.2[[j]] <- curStart
out.3[[j]] <- curEnd
j <- j + 1L
curId <- nextId
curStart <- nextStart
curEnd <- nextEnd
} else {
curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
}
}
out.1[[j]] <- curId
out.2[[j]] <- curStart
out.3[[j]] <- curEnd
theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))
theOutput
}
Then, the following code will show the speed difference. I just took your data and replicated it 1000 times...
x <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957,
11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"),
END = structure(c(11047, 11108, 11169, 11261, 11047, 11031,
11062, 11123, 11153), class = "Date")), .Names = c("ID",
"START", "END"), class = "data.frame", row.names = c(NA, 9L))
r <- 1000
y <- data.frame(ID=rep(x$ID, r) + rep(1:r, each=nrow(x))-1, START=rep(x$START, r), END=rep(x$END, r))
system.time( a1 <- smoothingEpisodes(y) ) # 2.95 seconds
system.time( a2 <- smoothingEpisodes3(y) ) # 0.10 seconds
all.equal( a1, a2 )