I have a big performance problem in R. I wrote a function that iterates over a data.frame
object. It simply adds a new column to a data.frame
and a
The answers here are great. One minor aspect not covered is that the question states "My PC is still working (about 10h now) and I have no idea about the runtime". I always put in the following code into loops when developing to get a feel for how changes seem to affect the speed and also for monitoring how long it will take to complete.
dayloop2 <- function(temp){
for (i in 1:nrow(temp)){
cat(round(i/nrow(temp)*100,2),"% \r") # prints the percentage complete in realtime.
# do stuff
}
return(blah)
}
Works with lapply as well.
dayloop2 <- function(temp){
temp <- lapply(1:nrow(temp), function(i) {
cat(round(i/nrow(temp)*100,2),"% \r")
#do stuff
})
return(temp)
}
If the function within the loop is quite fast but the number of loops is large then consider just printing every so often as printing to the console itself has an overhead. e.g.
dayloop2 <- function(temp){
for (i in 1:nrow(temp)){
if(i %% 100 == 0) cat(round(i/nrow(temp)*100,2),"% \r") # prints every 100 times through the loop
# do stuff
}
return(temp)
}
This could be made much faster by skipping the loops by using indexes or nested ifelse()
statements.
idx <- 1:nrow(temp)
temp[,10] <- idx
idx1 <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
temp[idx1,10] <- temp[idx1,9] + temp[which(idx1)-1,10]
temp[!idx1,10] <- temp[!idx1,9]
temp[1,10] <- temp[1,9]
names(temp)[names(temp) == "V10"] <- "Kumm."
If you are using for
loops, you are most likely coding R as if it was C or Java or something else. R code that is properly vectorised is extremely fast.
Take for example these two simple bits of code to generate a list of 10,000 integers in sequence:
The first code example is how one would code a loop using a traditional coding paradigm. It takes 28 seconds to complete
system.time({
a <- NULL
for(i in 1:1e5)a[i] <- i
})
user system elapsed
28.36 0.07 28.61
You can get an almost 100-times improvement by the simple action of pre-allocating memory:
system.time({
a <- rep(1, 1e5)
for(i in 1:1e5)a[i] <- i
})
user system elapsed
0.30 0.00 0.29
But using the base R vector operation using the colon operator :
this operation is virtually instantaneous:
system.time(a <- 1:1e5)
user system elapsed
0 0 0
I dislike rewriting code... Also of course ifelse and lapply are better options but sometimes it is difficult to make that fit.
Frequently I use data.frames as one would use lists such as df$var[i]
Here is a made up example:
nrow=function(x){ ##required as I use nrow at times.
if(class(x)=='list') {
length(x[[names(x)[1]]])
}else{
base::nrow(x)
}
}
system.time({
d=data.frame(seq=1:10000,r=rnorm(10000))
d$foo=d$r
d$seq=1:5
mark=NA
for(i in 1:nrow(d)){
if(d$seq[i]==1) mark=d$r[i]
d$foo[i]=mark
}
})
system.time({
d=data.frame(seq=1:10000,r=rnorm(10000))
d$foo=d$r
d$seq=1:5
d=as.list(d) #become a list
mark=NA
for(i in 1:nrow(d)){
if(d$seq[i]==1) mark=d$r[i]
d$foo[i]=mark
}
d=as.data.frame(d) #revert back to data.frame
})
data.frame version:
user system elapsed
0.53 0.00 0.53
list version:
user system elapsed
0.04 0.00 0.03
17x times faster to use a list of vectors than a data.frame.
Any comments on why internally data.frames are so slow in this regard? One would think they operate like lists...
For even faster code do this class(d)='list'
instead of d=as.list(d)
and class(d)='data.frame'
system.time({
d=data.frame(seq=1:10000,r=rnorm(10000))
d$foo=d$r
d$seq=1:5
class(d)='list'
mark=NA
for(i in 1:nrow(d)){
if(d$seq[i]==1) mark=d$r[i]
d$foo[i]=mark
}
class(d)='data.frame'
})
head(d)
Take a look at the accumulate()
function from {purrr}
:
dayloop_accumulate <- function(temp) {
temp %>%
as_tibble() %>%
mutate(cond = c(FALSE, (V6 == lag(V6) & V3 == lag(V3))[-1])) %>%
mutate(V10 = V9 %>%
purrr::accumulate2(.y = cond[-1], .f = function(.i_1, .i, .y) {
if(.y) {
.i_1 + .i
} else {
.i
}
}) %>% unlist()) %>%
select(-cond)
}
Processing with data.table
is a viable option:
n <- 1000000
df <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
colnames(df) <- paste("col", 1:9, sep = "")
library(data.table)
dayloop2.dt <- function(df) {
dt <- data.table(df)
dt[, Kumm. := {
res <- .I;
ifelse (res > 1,
ifelse ((col6 == shift(col6, fill = 0)) & (col3 == shift(col3, fill = 0)) ,
res <- col9 + shift(res)
, # else
res <- col9
)
, # else
res <- col9
)
}
,]
res <- data.frame(dt)
return (res)
}
res <- dayloop2.dt(df)
m <- microbenchmark(dayloop2.dt(df), times = 100)
#Unit: milliseconds
# expr min lq mean median uq max neval
#dayloop2.dt(df) 436.4467 441.02076 578.7126 503.9874 575.9534 966.1042 10
If you ignore the possible gains from conditions filtering, it is very fast. Obviously, if you can do the calculation on the subset of data, it helps.