I have a big performance problem in R. I wrote a function that iterates over a data.frame
object. It simply adds a new column to a data.frame
and a
Processing with data.table
is a viable option:
n <- 1000000
df <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
colnames(df) <- paste("col", 1:9, sep = "")
library(data.table)
dayloop2.dt <- function(df) {
dt <- data.table(df)
dt[, Kumm. := {
res <- .I;
ifelse (res > 1,
ifelse ((col6 == shift(col6, fill = 0)) & (col3 == shift(col3, fill = 0)) ,
res <- col9 + shift(res)
, # else
res <- col9
)
, # else
res <- col9
)
}
,]
res <- data.frame(dt)
return (res)
}
res <- dayloop2.dt(df)
m <- microbenchmark(dayloop2.dt(df), times = 100)
#Unit: milliseconds
# expr min lq mean median uq max neval
#dayloop2.dt(df) 436.4467 441.02076 578.7126 503.9874 575.9534 966.1042 10
If you ignore the possible gains from conditions filtering, it is very fast. Obviously, if you can do the calculation on the subset of data, it helps.