Speed up the loop operation in R

前端 未结 10 2089
说谎
说谎 2020-11-22 00:04

I have a big performance problem in R. I wrote a function that iterates over a data.frame object. It simply adds a new column to a data.frame and a

相关标签:
10条回答
  • 2020-11-22 00:10

    The answers here are great. One minor aspect not covered is that the question states "My PC is still working (about 10h now) and I have no idea about the runtime". I always put in the following code into loops when developing to get a feel for how changes seem to affect the speed and also for monitoring how long it will take to complete.

    dayloop2 <- function(temp){
      for (i in 1:nrow(temp)){
        cat(round(i/nrow(temp)*100,2),"%    \r") # prints the percentage complete in realtime.
        # do stuff
      }
      return(blah)
    }
    

    Works with lapply as well.

    dayloop2 <- function(temp){
      temp <- lapply(1:nrow(temp), function(i) {
        cat(round(i/nrow(temp)*100,2),"%    \r")
        #do stuff
      })
      return(temp)
    }
    

    If the function within the loop is quite fast but the number of loops is large then consider just printing every so often as printing to the console itself has an overhead. e.g.

    dayloop2 <- function(temp){
      for (i in 1:nrow(temp)){
        if(i %% 100 == 0) cat(round(i/nrow(temp)*100,2),"%    \r") # prints every 100 times through the loop
        # do stuff
      }
      return(temp)
    }
    
    0 讨论(0)
  • 2020-11-22 00:11

    This could be made much faster by skipping the loops by using indexes or nested ifelse() statements.

    idx <- 1:nrow(temp)
    temp[,10] <- idx
    idx1 <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
    temp[idx1,10] <- temp[idx1,9] + temp[which(idx1)-1,10] 
    temp[!idx1,10] <- temp[!idx1,9]    
    temp[1,10] <- temp[1,9]
    names(temp)[names(temp) == "V10"] <- "Kumm."
    
    0 讨论(0)
  • 2020-11-22 00:15

    If you are using for loops, you are most likely coding R as if it was C or Java or something else. R code that is properly vectorised is extremely fast.

    Take for example these two simple bits of code to generate a list of 10,000 integers in sequence:

    The first code example is how one would code a loop using a traditional coding paradigm. It takes 28 seconds to complete

    system.time({
        a <- NULL
        for(i in 1:1e5)a[i] <- i
    })
       user  system elapsed 
      28.36    0.07   28.61 
    

    You can get an almost 100-times improvement by the simple action of pre-allocating memory:

    system.time({
        a <- rep(1, 1e5)
        for(i in 1:1e5)a[i] <- i
    })
    
       user  system elapsed 
       0.30    0.00    0.29 
    

    But using the base R vector operation using the colon operator : this operation is virtually instantaneous:

    system.time(a <- 1:1e5)
    
       user  system elapsed 
          0       0       0 
    
    0 讨论(0)
  • 2020-11-22 00:17

    I dislike rewriting code... Also of course ifelse and lapply are better options but sometimes it is difficult to make that fit.

    Frequently I use data.frames as one would use lists such as df$var[i]

    Here is a made up example:

    nrow=function(x){ ##required as I use nrow at times.
      if(class(x)=='list') {
        length(x[[names(x)[1]]])
      }else{
        base::nrow(x)
      }
    }
    
    system.time({
      d=data.frame(seq=1:10000,r=rnorm(10000))
      d$foo=d$r
      d$seq=1:5
      mark=NA
      for(i in 1:nrow(d)){
        if(d$seq[i]==1) mark=d$r[i]
        d$foo[i]=mark
      }
    })
    
    system.time({
      d=data.frame(seq=1:10000,r=rnorm(10000))
      d$foo=d$r
      d$seq=1:5
      d=as.list(d) #become a list
      mark=NA
      for(i in 1:nrow(d)){
        if(d$seq[i]==1) mark=d$r[i]
        d$foo[i]=mark
      }
      d=as.data.frame(d) #revert back to data.frame
    })
    

    data.frame version:

       user  system elapsed 
       0.53    0.00    0.53
    

    list version:

       user  system elapsed 
       0.04    0.00    0.03 
    

    17x times faster to use a list of vectors than a data.frame.

    Any comments on why internally data.frames are so slow in this regard? One would think they operate like lists...

    For even faster code do this class(d)='list' instead of d=as.list(d) and class(d)='data.frame'

    system.time({
      d=data.frame(seq=1:10000,r=rnorm(10000))
      d$foo=d$r
      d$seq=1:5
      class(d)='list'
      mark=NA
      for(i in 1:nrow(d)){
        if(d$seq[i]==1) mark=d$r[i]
        d$foo[i]=mark
      }
      class(d)='data.frame'
    })
    head(d)
    
    0 讨论(0)
  • 2020-11-22 00:18

    Take a look at the accumulate() function from {purrr} :

    dayloop_accumulate <- function(temp) {
      temp %>%
        as_tibble() %>%
         mutate(cond = c(FALSE, (V6 == lag(V6) & V3 == lag(V3))[-1])) %>%
        mutate(V10 = V9 %>% 
                 purrr::accumulate2(.y = cond[-1], .f = function(.i_1, .i, .y) {
                   if(.y) {
                     .i_1 + .i
                   } else {
                     .i
                   }
                 }) %>% unlist()) %>%
        select(-cond)
    }
    
    0 讨论(0)
  • 2020-11-22 00:23

    Processing with data.table is a viable option:

    n <- 1000000
    df <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
    colnames(df) <- paste("col", 1:9, sep = "")
    
    library(data.table)
    
    dayloop2.dt <- function(df) {
      dt <- data.table(df)
      dt[, Kumm. := {
        res <- .I;
        ifelse (res > 1,             
          ifelse ((col6 == shift(col6, fill = 0)) & (col3 == shift(col3, fill = 0)) , 
            res <- col9 + shift(res)                   
          , # else
            res <- col9                                 
          )
         , # else
          res <- col9
        )
      }
      ,]
      res <- data.frame(dt)
      return (res)
    }
    
    res <- dayloop2.dt(df)
    
    m <- microbenchmark(dayloop2.dt(df), times = 100)
    #Unit: milliseconds
    #       expr      min        lq     mean   median       uq      max neval
    #dayloop2.dt(df) 436.4467 441.02076 578.7126 503.9874 575.9534 966.1042    10
    

    If you ignore the possible gains from conditions filtering, it is very fast. Obviously, if you can do the calculation on the subset of data, it helps.

    0 讨论(0)
提交回复
热议问题