“smoothing” time data - can it be done more efficient?

后端 未结 3 750
感情败类
感情败类 2020-12-22 11:06

I have a data frame containing an ID, a start date and an end date. My data is ordered by ID, start, end (in this sequence).

Now I want all rows with the same ID hav

相关标签:
3条回答
  • 2020-12-22 11:22

    I did it slightly different to avoid deleting empty rows in the end:

    smoothingEpisodes <- function (theData) {
        curId <- theData[1, "ID"]
        curStart <- theData[1, "START"]
        curEnd <- theData[1, "END"]
    
        theLength <- nrow(theData)
    
        out.1 <- integer(length = theLength)
        out.2 <- out.3 <- numeric(length = theLength)
    
        j <- 1
    
        for(i in 2:nrow(theData)) {
            nextId <- theData[i, "ID"]
            nextStart <- theData[i, "START"]
            nextEnd <- theData[i, "END"]
    
            if (curId != nextId | (curEnd + 1) < nextStart) {
                out.1[j] <- curId
                out.2[j] <- curStart
                out.3[j] <- curEnd
    
                j <- j + 1
    
                curId <- nextId
                curStart <- nextStart
                curEnd <- nextEnd
            } else {
                curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
            }
        }
    
        out.1[j] <- curId
        out.2[j] <- curStart
        out.3[j] <- curEnd
    
        theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))
    
        theOutput
    }
    

    quite a big improvement to my original version!

    0 讨论(0)
  • The first [without really thinking to hard about what you are trying to do] optimisation I would suggest is to allocate storage for theOutput. At the moment, you are growing theOutput at each iteration of the loop. In R that is an absolute no no!! That is something you never do, unless you like woefully slow code. R has to copy the object and expand it during each iteration and that is slow.

    Looking at the code, we know that theOutput needs to have nrow(theData) - 1 rows, and 3 columns. So create that before the loop starts:

    theOutput <- data.frame(matrix(ncol = 3, nrow = nrow(theData) - 1))
    

    then fill in this object during the loop:

    theOutput[i, ] <- data.frame("ID" = curId, "START" = curStart, "END" = curEnd))
    

    for example.

    It isn't clear what START and END are? if these are numerics, then working with a matrix and not a data frame could also improve speed efficiency.

    Also, creating a data frame each iteration is going to be slow. I can't time this without spending a lot of my own time, but you could just fill in the bits you want directly, without incurring the data.frame() call during each iteration:

    theOutput[i, "ID"] <- curId
    theOutput[i, "START"] <- curStart
    theOutput[i, "END"] <- curEnd
    

    The best tip I can give you however, is to profile your code. See where the bottlenecks are and speed those up. Run your function on a smaller subset of the data; the size of which is sufficient to give you a bit of run-time to gather useful profiling data without having to wait for ages to get the profiling run completed. To profile in R, use Rprof():

    Rprof(filename = "my_fun_profile.Rprof")
    ## run your function call here on a subset of the data
    Rprof(NULL)
    

    The you can look at the output using

    summaryRprof("my_fun_profile.Rprof")
    

    Hadley Wickham (@hadley) has a package to make this a bit easier. It is called profr. And as Dirk reminds me in the comments, there is also Luke Tierney's proftools package.

    Edit: as the OP provided some test data I knocked up something quick to show the speed-up achieved by just following good loop practice:

    smoothingEpisodes2 <- function (theData) {
        curId <- theData[1, "ID"]
        curStart <- theData[1, "START"]
        curEnd <- theData[1, "END"]
        nr <- nrow(theData)
        out1 <- integer(length = nr)
        out2 <- out3 <- numeric(length = nr)
        for(i in 2:nrow(theData)) {
            nextId <- theData[i, "ID"]
            nextStart <- theData[i, "START"]
            nextEnd <- theData[i, "END"]
            if (curId != nextId | (curEnd + 1) < nextStart) {
                out1[i-1] <- curId
                out2[i-1] <- curStart
                out3[i-1] <- curEnd
                curId <- nextId
                curStart <- nextStart
                curEnd <- nextEnd
            } else {
                curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
            }
        }
        out1[i] <- curId
        out2[i] <- curStart
        out3[i] <- curEnd
        theOutput <- data.frame(ID = out1,
                                START = as.Date(out2, origin = "1970-01-01"),
                                END = as.Date(out3, origin = "1970-01-01"))
        ## drop empty
        theOutput <- theOutput[-which(theOutput$ID == 0), ]
        theOutput
    }
    

    Using the test dataset provide in object testData, I get:

    > res1 <- smoothingEpisodes(testData)
    > system.time(replicate(100, smoothingEpisodes(testData)))
       user  system elapsed 
      1.091   0.000   1.131 
    > res2 <- smoothingEpisodes2(testData)
    > system.time(replicate(100, smoothingEpisodes2(testData)))
       user  system elapsed 
      0.506   0.004   0.517
    

    a 50% speed up. Not dramatic but simple to achieve just by not growing an object at each iteration.

    0 讨论(0)
  • 2020-12-22 11:36

    Marcel, I thought I'd just try to improve your code a little. The version below is about 30x faster (from 3 seconds to 0.1 seconds)... The trick is to first extract the three columns to integer and double vectors.

    As a side note, I try to use [[ where applicable, and try to keep integers as integers by writing j <- j + 1L etc. That does not make any difference here, but sometimes coercing between integers and doubles can take quite some time.

    smoothingEpisodes3 <- function (theData) {
        theLength <- nrow(theData)
        if (theLength < 2L) return(theData)
    
        id <- as.integer(theData[["ID"]])
        start <- as.numeric(theData[["START"]])
        end <- as.numeric(theData[["END"]])
    
        curId <- id[[1L]]
        curStart <- start[[1L]]
        curEnd <- end[[1L]]
    
        out.1 <- integer(length = theLength)
        out.2 <- out.3 <- numeric(length = theLength)
    
        j <- 1L
    
        for(i in 2:nrow(theData)) {
            nextId <- id[[i]]
            nextStart <- start[[i]]
            nextEnd <- end[[i]]
    
            if (curId != nextId | (curEnd + 1) < nextStart) {
                out.1[[j]] <- curId
                out.2[[j]] <- curStart
                out.3[[j]] <- curEnd
    
                j <- j + 1L
    
                curId <- nextId
                curStart <- nextStart
                curEnd <- nextEnd
            } else {
                curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
            }
        }
    
        out.1[[j]] <- curId
        out.2[[j]] <- curStart
        out.3[[j]] <- curEnd
    
        theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))
    
        theOutput
    }
    

    Then, the following code will show the speed difference. I just took your data and replicated it 1000 times...

    x <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957, 
    11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"), 
        END = structure(c(11047, 11108, 11169, 11261, 11047, 11031, 
        11062, 11123, 11153), class = "Date")), .Names = c("ID", 
    "START", "END"), class = "data.frame", row.names = c(NA, 9L))
    
    r <- 1000
    y <- data.frame(ID=rep(x$ID, r) + rep(1:r, each=nrow(x))-1, START=rep(x$START, r), END=rep(x$END, r))
    
    system.time( a1 <- smoothingEpisodes(y) )   # 2.95 seconds
    system.time( a2 <- smoothingEpisodes3(y) )  # 0.10 seconds
    all.equal( a1, a2 )
    
    0 讨论(0)
提交回复
热议问题