What tricks do people use to manage the available memory of an interactive R session? I use the functions below [based on postings by Petr Pikal and David Hinds to the r-he
I really appreciate some of the answers above, following @hadley and @Dirk that suggest closing R and issuing source
and using command line I come up with a solution that worked very well for me. I had to deal with hundreds of mass spectras, each occupies around 20 Mb of memory so I used two R scripts, as follows:
First a wrapper:
#!/usr/bin/Rscript --vanilla --default-packages=utils
for(l in 1:length(fdir)) {
for(k in 1:length(fds)) {
system(paste("Rscript runConsensus.r", l, k))
}
}
with this script I basically control what my main script do runConsensus.r
, and I write the data answer for the output. With this, each time the wrapper calls the script it seems the R is reopened and the memory is freed.
Hope it helps.
I use the data.table package. With its :=
operator you can :
None of these operations copy the (potentially large) data.table
at all, not even once.
data.table
uses much less working memory.Related links :
If you are working on Linux and want to use several processes and only have to do read operations on one or more large objects use makeForkCluster
instead of a makePSOCKcluster
. This also saves you the time sending the large object to the other processes.
I'm fortunate and my large data sets are saved by the instrument in "chunks" (subsets) of roughly 100 MB (32bit binary). Thus I can do pre-processing steps (deleting uninformative parts, downsampling) sequentially before fusing the data set.
Calling gc ()
"by hand" can help if the size of the data get close to available memory.
Sometimes a different algorithm needs much less memory.
Sometimes there's a trade off between vectorization and memory use.
compare: split
& lapply
vs. a for
loop.
For the sake of fast & easy data analysis, I often work first with a small random subset (sample ()
) of the data. Once the data analysis script/.Rnw is finished data analysis code and the complete data go to the calculation server for over night / over weekend / ... calculation.
Tip for dealing with objects requiring heavy intermediate calculation: When using objects that require a lot of heavy calculation and intermediate steps to create, I often find it useful to write a chunk of code with the function to create the object, and then a separate chunk of code that gives me the option either to generate and save the object as an rmd
file, or load it externally from an rmd
file I have already previously saved. This is especially easy to do in R Markdown
using the following code-chunk structure.
```{r Create OBJECT}
COMPLICATED.FUNCTION <- function(...) { Do heavy calculations needing lots of memory;
Output OBJECT; }
```
```{r Generate or load OBJECT}
LOAD <- TRUE;
#NOTE: Set LOAD to TRUE if you want to load saved file
#NOTE: Set LOAD to FALSE if you want to generate and save
if(LOAD == TRUE) { OBJECT <- readRDS(file = 'MySavedObject.rds'); } else
{ OBJECT <- COMPLICATED.FUNCTION(x, y, z);
saveRDS(file = 'MySavedObject.rds', object = OBJECT); }
```
With this code structure, all I need to do is to change LOAD
depending on whether I want to generate and save the object, or load it directly from an existing saved file. (Of course, I have to generate it and save it the first time, but after this I have the option of loading it.) Setting LOAD = TRUE
bypasses use of my complicated function and avoids all of the heavy computation therein. This method still requires enough memory to store the object of interest, but it saves you from having to calculate it each time you run your code. For objects that require a lot of heavy calculation of intermediate steps (e.g., for calculations involving loops over large arrays) this can save a substantial amount of time and computation.
I try to keep the amount of objects small when working in a larger project with a lot of intermediate steps. So instead of creating many unique objects called
dataframe
-> step1
-> step2
-> step3
-> result
raster
-> multipliedRast
-> meanRastF
-> sqrtRast
-> resultRast
I work with temporary objects that I call temp
.
dataframe
-> temp
-> temp
-> temp
-> result
Which leaves me with less intermediate files and more overview.
raster <- raster('file.tif')
temp <- raster * 10
temp <- mean(temp)
resultRast <- sqrt(temp)
To save more memory I can simply remove temp
when no longer needed.
rm(temp)
If I need several intermediate files, I use temp1
, temp2
, temp3
.
For testing I use test
, test2
, ...