Speed up the loop operation in R

前端未结

关注

 10  2118

I have a big performance problem in R. I wrote a function that iterates over a data.frame object. It simply adds a new column to a data.frame and a

相关标签:

10条回答

说谎

2020-11-22 00:10
The answers here are great. One minor aspect not covered is that the question states "My PC is still working (about 10h now) and I have no idea about the runtime". I always put in the following code into loops when developing to get a feel for how changes seem to affect the speed and also for monitoring how long it will take to complete.
```
dayloop2 <- function(temp){
  for (i in 1:nrow(temp)){
    cat(round(i/nrow(temp)*100,2),"%    \r") # prints the percentage complete in realtime.
    # do stuff
  }
  return(blah)
}
```
Works with lapply as well.
```
dayloop2 <- function(temp){
  temp <- lapply(1:nrow(temp), function(i) {
    cat(round(i/nrow(temp)*100,2),"%    \r")
    #do stuff
  })
  return(temp)
}
```
If the function within the loop is quite fast but the number of loops is large then consider just printing every so often as printing to the console itself has an overhead. e.g.
```
dayloop2 <- function(temp){
  for (i in 1:nrow(temp)){
    if(i %% 100 == 0) cat(round(i/nrow(temp)*100,2),"%    \r") # prints every 100 times through the loop
    # do stuff
  }
  return(temp)
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

北荒

2020-11-22 00:11

This could be made much faster by skipping the loops by using indexes or nested ifelse() statements.

idx <- 1:nrow(temp)
temp[,10] <- idx
idx1 <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
temp[idx1,10] <- temp[idx1,9] + temp[which(idx1)-1,10] 
temp[!idx1,10] <- temp[!idx1,9]    
temp[1,10] <- temp[1,9]
names(temp)[names(temp) == "V10"] <- "Kumm."

0 讨论(0)

长发绾君心

2020-11-22 00:15
If you are using for loops, you are most likely coding R as if it was C or Java or something else. R code that is properly vectorised is extremely fast.

Take for example these two simple bits of code to generate a list of 10,000 integers in sequence:

The first code example is how one would code a loop using a traditional coding paradigm. It takes 28 seconds to complete
```
system.time({
    a <- NULL
    for(i in 1:1e5)a[i] <- i
})
   user  system elapsed 
  28.36    0.07   28.61 
```
You can get an almost 100-times improvement by the simple action of pre-allocating memory:
```
system.time({
    a <- rep(1, 1e5)
    for(i in 1:1e5)a[i] <- i
})

   user  system elapsed 
   0.30    0.00    0.29 
```
But using the base R vector operation using the colon operator : this operation is virtually instantaneous:
```
system.time(a <- 1:1e5)

   user  system elapsed 
      0       0       0 
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

生来不讨喜

2020-11-22 00:17

I dislike rewriting code... Also of course ifelse and lapply are better options but sometimes it is difficult to make that fit.

Frequently I use data.frames as one would use lists such as df$var[i]

Here is a made up example:

nrow=function(x){ ##required as I use nrow at times.
  if(class(x)=='list') {
    length(x[[names(x)[1]]])
  }else{
    base::nrow(x)
  }
}

system.time({
  d=data.frame(seq=1:10000,r=rnorm(10000))
  d$foo=d$r
  d$seq=1:5
  mark=NA
  for(i in 1:nrow(d)){
    if(d$seq[i]==1) mark=d$r[i]
    d$foo[i]=mark
  }
})

system.time({
  d=data.frame(seq=1:10000,r=rnorm(10000))
  d$foo=d$r
  d$seq=1:5
  d=as.list(d) #become a list
  mark=NA
  for(i in 1:nrow(d)){
    if(d$seq[i]==1) mark=d$r[i]
    d$foo[i]=mark
  }
  d=as.data.frame(d) #revert back to data.frame
})

data.frame version:

   user  system elapsed 
   0.53    0.00    0.53

list version:

   user  system elapsed 
   0.04    0.00    0.03

17x times faster to use a list of vectors than a data.frame.

Any comments on why internally data.frames are so slow in this regard? One would think they operate like lists...

For even faster code do this class(d)='list' instead of d=as.list(d) and class(d)='data.frame'

system.time({
  d=data.frame(seq=1:10000,r=rnorm(10000))
  d$foo=d$r
  d$seq=1:5
  class(d)='list'
  mark=NA
  for(i in 1:nrow(d)){
    if(d$seq[i]==1) mark=d$r[i]
    d$foo[i]=mark
  }
  class(d)='data.frame'
})
head(d)

0 讨论(0)

情歌与酒

2020-11-22 00:18

Take a look at the accumulate() function from {purrr} :

dayloop_accumulate <- function(temp) {
  temp %>%
    as_tibble() %>%
     mutate(cond = c(FALSE, (V6 == lag(V6) & V3 == lag(V3))[-1])) %>%
    mutate(V10 = V9 %>% 
             purrr::accumulate2(.y = cond[-1], .f = function(.i_1, .i, .y) {
               if(.y) {
                 .i_1 + .i
               } else {
                 .i
               }
             }) %>% unlist()) %>%
    select(-cond)
}

0 讨论(0)

[愿得一人]

2020-11-22 00:23

Processing with data.table is a viable option:

n <- 1000000
df <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
colnames(df) <- paste("col", 1:9, sep = "")

library(data.table)

dayloop2.dt <- function(df) {
  dt <- data.table(df)
  dt[, Kumm. := {
    res <- .I;
    ifelse (res > 1,             
      ifelse ((col6 == shift(col6, fill = 0)) & (col3 == shift(col3, fill = 0)) , 
        res <- col9 + shift(res)                   
      , # else
        res <- col9                                 
      )
     , # else
      res <- col9
    )
  }
  ,]
  res <- data.frame(dt)
  return (res)
}

res <- dayloop2.dt(df)

m <- microbenchmark(dayloop2.dt(df), times = 100)
#Unit: milliseconds
#       expr      min        lq     mean   median       uq      max neval
#dayloop2.dt(df) 436.4467 441.02076 578.7126 503.9874 575.9534 966.1042    10

If you ignore the possible gains from conditions filtering, it is very fast. Obviously, if you can do the calculation on the subset of data, it helps.

0 讨论(0)

1 2 下一页