“smoothing” time data - can it be done more efficient?

后端未结

关注

 3  752

I have a data frame containing an ID, a start date and an end date. My data is ordered by ID, start, end (in this sequence).

Now I want all rows with the same ID hav

相关标签:

3条回答

南方客

2020-12-22 11:22

I did it slightly different to avoid deleting empty rows in the end:

smoothingEpisodes <- function (theData) {
    curId <- theData[1, "ID"]
    curStart <- theData[1, "START"]
    curEnd <- theData[1, "END"]

    theLength <- nrow(theData)

    out.1 <- integer(length = theLength)
    out.2 <- out.3 <- numeric(length = theLength)

    j <- 1

    for(i in 2:nrow(theData)) {
        nextId <- theData[i, "ID"]
        nextStart <- theData[i, "START"]
        nextEnd <- theData[i, "END"]

        if (curId != nextId | (curEnd + 1) < nextStart) {
            out.1[j] <- curId
            out.2[j] <- curStart
            out.3[j] <- curEnd

            j <- j + 1

            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }

    out.1[j] <- curId
    out.2[j] <- curStart
    out.3[j] <- curEnd

    theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))

    theOutput
}

quite a big improvement to my original version!

0 讨论(0)

不要未来只要你来

2020-12-22 11:25
The first [without really thinking to hard about what you are trying to do] optimisation I would suggest is to allocate storage for theOutput. At the moment, you are growing theOutput at each iteration of the loop. In R that is an absolute no no!! That is something you never do, unless you like woefully slow code. R has to copy the object and expand it during each iteration and that is slow.

Looking at the code, we know that theOutput needs to have nrow(theData) - 1 rows, and 3 columns. So create that before the loop starts:
```
theOutput <- data.frame(matrix(ncol = 3, nrow = nrow(theData) - 1))
```
then fill in this object during the loop:
```
theOutput[i, ] <- data.frame("ID" = curId, "START" = curStart, "END" = curEnd))
```
for example.

It isn't clear what START and END are? if these are numerics, then working with a matrix and not a data frame could also improve speed efficiency.

Also, creating a data frame each iteration is going to be slow. I can't time this without spending a lot of my own time, but you could just fill in the bits you want directly, without incurring the data.frame() call during each iteration:
```
theOutput[i, "ID"] <- curId
theOutput[i, "START"] <- curStart
theOutput[i, "END"] <- curEnd
```
The best tip I can give you however, is to profile your code. See where the bottlenecks are and speed those up. Run your function on a smaller subset of the data; the size of which is sufficient to give you a bit of run-time to gather useful profiling data without having to wait for ages to get the profiling run completed. To profile in R, use Rprof():
```
Rprof(filename = "my_fun_profile.Rprof")
## run your function call here on a subset of the data
Rprof(NULL)
```
The you can look at the output using
```
summaryRprof("my_fun_profile.Rprof")
```
Hadley Wickham (@hadley) has a package to make this a bit easier. It is called profr. And as Dirk reminds me in the comments, there is also Luke Tierney's proftools package.

Edit: as the OP provided some test data I knocked up something quick to show the speed-up achieved by just following good loop practice:
```
smoothingEpisodes2 <- function (theData) {
    curId <- theData[1, "ID"]
    curStart <- theData[1, "START"]
    curEnd <- theData[1, "END"]
    nr <- nrow(theData)
    out1 <- integer(length = nr)
    out2 <- out3 <- numeric(length = nr)
    for(i in 2:nrow(theData)) {
        nextId <- theData[i, "ID"]
        nextStart <- theData[i, "START"]
        nextEnd <- theData[i, "END"]
        if (curId != nextId | (curEnd + 1) < nextStart) {
            out1[i-1] <- curId
            out2[i-1] <- curStart
            out3[i-1] <- curEnd
            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }
    out1[i] <- curId
    out2[i] <- curStart
    out3[i] <- curEnd
    theOutput <- data.frame(ID = out1,
                            START = as.Date(out2, origin = "1970-01-01"),
                            END = as.Date(out3, origin = "1970-01-01"))
    ## drop empty
    theOutput <- theOutput[-which(theOutput$ID == 0), ]
    theOutput
}
```
Using the test dataset provide in object testData, I get:
```
> res1 <- smoothingEpisodes(testData)
> system.time(replicate(100, smoothingEpisodes(testData)))
   user  system elapsed 
  1.091   0.000   1.131 
> res2 <- smoothingEpisodes2(testData)
> system.time(replicate(100, smoothingEpisodes2(testData)))
   user  system elapsed 
  0.506   0.004   0.517
```
a 50% speed up. Not dramatic but simple to achieve just by not growing an object at each iteration.
0 讨论(0)
发布评论:

提交评论
- 加载中...

情话喂你

2020-12-22 11:36

Marcel, I thought I'd just try to improve your code a little. The version below is about 30x faster (from 3 seconds to 0.1 seconds)... The trick is to first extract the three columns to integer and double vectors.

As a side note, I try to use [[ where applicable, and try to keep integers as integers by writing j <- j + 1L etc. That does not make any difference here, but sometimes coercing between integers and doubles can take quite some time.

smoothingEpisodes3 <- function (theData) {
    theLength <- nrow(theData)
    if (theLength < 2L) return(theData)

    id <- as.integer(theData[["ID"]])
    start <- as.numeric(theData[["START"]])
    end <- as.numeric(theData[["END"]])

    curId <- id[[1L]]
    curStart <- start[[1L]]
    curEnd <- end[[1L]]

    out.1 <- integer(length = theLength)
    out.2 <- out.3 <- numeric(length = theLength)

    j <- 1L

    for(i in 2:nrow(theData)) {
        nextId <- id[[i]]
        nextStart <- start[[i]]
        nextEnd <- end[[i]]

        if (curId != nextId | (curEnd + 1) < nextStart) {
            out.1[[j]] <- curId
            out.2[[j]] <- curStart
            out.3[[j]] <- curEnd

            j <- j + 1L

            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }

    out.1[[j]] <- curId
    out.2[[j]] <- curStart
    out.3[[j]] <- curEnd

    theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))

    theOutput
}

Then, the following code will show the speed difference. I just took your data and replicated it 1000 times...

x <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957, 
11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"), 
    END = structure(c(11047, 11108, 11169, 11261, 11047, 11031, 
    11062, 11123, 11153), class = "Date")), .Names = c("ID", 
"START", "END"), class = "data.frame", row.names = c(NA, 9L))

r <- 1000
y <- data.frame(ID=rep(x$ID, r) + rep(1:r, each=nrow(x))-1, START=rep(x$START, r), END=rep(x$END, r))

system.time( a1 <- smoothingEpisodes(y) )   # 2.95 seconds
system.time( a2 <- smoothingEpisodes3(y) )  # 0.10 seconds
all.equal( a1, a2 )

0 讨论(0)