Tricks to manage the available memory in an R session

前端未结

关注

 27  1638

What tricks do people use to manage the available memory of an interactive R session? I use the functions below [based on postings by Petr Pikal and David Hinds to the r-he

相关标签:

27条回答

南笙

2020-11-22 01:48
I really appreciate some of the answers above, following @hadley and @Dirk that suggest closing R and issuing source and using command line I come up with a solution that worked very well for me. I had to deal with hundreds of mass spectras, each occupies around 20 Mb of memory so I used two R scripts, as follows:

First a wrapper:
```
#!/usr/bin/Rscript --vanilla --default-packages=utils

for(l in 1:length(fdir)) {

   for(k in 1:length(fds)) {
     system(paste("Rscript runConsensus.r", l, k))
   }
}
```
with this script I basically control what my main script do runConsensus.r, and I write the data answer for the output. With this, each time the wrapper calls the script it seems the R is reopened and the memory is freed.

Hope it helps.
0 讨论(0)
发布评论:

提交评论
- 加载中...
一生所求

2020-11-22 01:49
I use the data.table package. With its := operator you can :
- Add columns by reference
- Modify subsets of existing columns by reference, and by group by reference
- Delete columns by reference
None of these operations copy the (potentially large) data.table at all, not even once.
- Aggregation is also particularly fast because data.table uses much less working memory.
Related links :
- News from data.table, London R presentation, 2012
- When should I use the := operator in data.table?
0 讨论(0)
发布评论:

提交评论
- 加载中...
眼角桃花

2020-11-22 01:49

If you are working on Linux and want to use several processes and only have to do read operations on one or more large objects use makeForkCluster instead of a makePSOCKcluster. This also saves you the time sending the large object to the other processes.

0 讨论(0)
发布评论:

提交评论
- 加载中...
面向向阳花

2020-11-22 01:52
1. I'm fortunate and my large data sets are saved by the instrument in "chunks" (subsets) of roughly 100 MB (32bit binary). Thus I can do pre-processing steps (deleting uninformative parts, downsampling) sequentially before fusing the data set.
2. Calling gc () "by hand" can help if the size of the data get close to available memory.
3. Sometimes a different algorithm needs much less memory.
  Sometimes there's a trade off between vectorization and memory use.
  compare: split & lapply vs. a for loop.
4. For the sake of fast & easy data analysis, I often work first with a small random subset (sample ()) of the data. Once the data analysis script/.Rnw is finished data analysis code and the complete data go to the calculation server for over night / over weekend / ... calculation.
0 讨论(0)
发布评论:

提交评论
- 加载中...
轻奢々

2020-11-22 01:54
Tip for dealing with objects requiring heavy intermediate calculation: When using objects that require a lot of heavy calculation and intermediate steps to create, I often find it useful to write a chunk of code with the function to create the object, and then a separate chunk of code that gives me the option either to generate and save the object as an rmd file, or load it externally from an rmd file I have already previously saved. This is especially easy to do in R Markdown using the following code-chunk structure.
```
```{r Create OBJECT}

COMPLICATED.FUNCTION <- function(...) { Do heavy calculations needing lots of memory;
                                        Output OBJECT; }

```
```{r Generate or load OBJECT}

LOAD <- TRUE;
#NOTE: Set LOAD to TRUE if you want to load saved file
#NOTE: Set LOAD to FALSE if you want to generate and save

if(LOAD == TRUE) { OBJECT <- readRDS(file = 'MySavedObject.rds'); } else
                 { OBJECT <- COMPLICATED.FUNCTION(x, y, z);
                             saveRDS(file = 'MySavedObject.rds', object = OBJECT); }

```
```
With this code structure, all I need to do is to change LOAD depending on whether I want to generate and save the object, or load it directly from an existing saved file. (Of course, I have to generate it and save it the first time, but after this I have the option of loading it.) Setting LOAD = TRUE bypasses use of my complicated function and avoids all of the heavy computation therein. This method still requires enough memory to store the object of interest, but it saves you from having to calculate it each time you run your code. For objects that require a lot of heavy calculation of intermediate steps (e.g., for calculations involving loops over large arrays) this can save a substantial amount of time and computation.
0 讨论(0)
发布评论:

提交评论
- 加载中...
日久生厌

2020-11-22 01:54
I try to keep the amount of objects small when working in a larger project with a lot of intermediate steps. So instead of creating many unique objects called

dataframe-> step1 -> step2 -> step3 -> result

raster-> multipliedRast -> meanRastF -> sqrtRast -> resultRast

I work with temporary objects that I call temp.

dataframe -> temp -> temp -> temp -> result

Which leaves me with less intermediate files and more overview.
```
raster  <- raster('file.tif')
temp <- raster * 10
temp <- mean(temp)
resultRast <- sqrt(temp)
```
To save more memory I can simply remove temp when no longer needed.
```
rm(temp)
```
If I need several intermediate files, I use temp1, temp2, temp3.

For testing I use test, test2, ...
0 讨论(0)
发布评论:

提交评论
- 加载中...