I have a big performance problem in R. I wrote a function that iterates over a data.frame
object. It simply adds a new column to a data.frame
and a
General strategies for speeding up R code
First, figure out where the slow part really is. There's no need to optimize code that isn't running slowly. For small amounts of code, simply thinking through it can work. If that fails, RProf and similar profiling tools can be helpful.
Once you figure out the bottleneck, think about more efficient algorithms for doing what you want. Calculations should be only run once if possible, so:
Using more efficient functions can produce moderate or large speed gains. For instance, paste0
produces a small efficiency gain but .colSums()
and its relatives produce somewhat more pronounced gains. mean
is particularly slow.
Then you can avoid some particularly common troubles:
cbind
will slow you down really quickly. Try for better vectorization, which can often but not always help. In this regard, inherently vectorized commands like ifelse
, diff
, and the like will provide more improvement than the apply
family of commands (which provide little to no speed boost over a well-written loop).
You can also try to provide more information to R functions. For instance, use vapply rather than sapply, and specify colClasses when reading in text-based data. Speed gains will be variable depending on how much guessing you eliminate.
Next, consider optimized packages: The data.table package can produce massive speed gains where its use is possible, in data manipulation and in reading large amounts of data (fread
).
Next, try for speed gains through more efficient means of calling R:
Ra
and jit
packages in concert for just-in-time compilation (Dirk has an example in this presentation).And lastly, if all of the above still doesn't get you quite as fast as you need, you may need to move to a faster language for the slow code snippet. The combination of Rcpp
and inline
here makes replacing only the slowest part of the algorithm with C++ code particularly easy. Here, for instance, is my first attempt at doing so, and it blows away even highly optimized R solutions.
If you're still left with troubles after all this, you just need more computing power. Look into parallelization (http://cran.r-project.org/web/views/HighPerformanceComputing.html) or even GPU-based solutions (gpu-tools
).
Links to other guidance