Tricks to manage the available memory in an R session

前端 未结 27 1363
情深已故
情深已故 2020-11-22 01:23

What tricks do people use to manage the available memory of an interactive R session? I use the functions below [based on postings by Petr Pikal and David Hinds to the r-he

相关标签:
27条回答
  • 2020-11-22 01:48

    I really appreciate some of the answers above, following @hadley and @Dirk that suggest closing R and issuing source and using command line I come up with a solution that worked very well for me. I had to deal with hundreds of mass spectras, each occupies around 20 Mb of memory so I used two R scripts, as follows:

    First a wrapper:

    #!/usr/bin/Rscript --vanilla --default-packages=utils
    
    for(l in 1:length(fdir)) {
    
       for(k in 1:length(fds)) {
         system(paste("Rscript runConsensus.r", l, k))
       }
    }
    

    with this script I basically control what my main script do runConsensus.r, and I write the data answer for the output. With this, each time the wrapper calls the script it seems the R is reopened and the memory is freed.

    Hope it helps.

    0 讨论(0)
  • 2020-11-22 01:49

    I use the data.table package. With its := operator you can :

    • Add columns by reference
    • Modify subsets of existing columns by reference, and by group by reference
    • Delete columns by reference

    None of these operations copy the (potentially large) data.table at all, not even once.

    • Aggregation is also particularly fast because data.table uses much less working memory.

    Related links :

    • News from data.table, London R presentation, 2012
    • When should I use the := operator in data.table?
    0 讨论(0)
  • 2020-11-22 01:49

    If you are working on Linux and want to use several processes and only have to do read operations on one or more large objects use makeForkCluster instead of a makePSOCKcluster. This also saves you the time sending the large object to the other processes.

    0 讨论(0)
  • 2020-11-22 01:52
    1. I'm fortunate and my large data sets are saved by the instrument in "chunks" (subsets) of roughly 100 MB (32bit binary). Thus I can do pre-processing steps (deleting uninformative parts, downsampling) sequentially before fusing the data set.

    2. Calling gc () "by hand" can help if the size of the data get close to available memory.

    3. Sometimes a different algorithm needs much less memory.
      Sometimes there's a trade off between vectorization and memory use.
      compare: split & lapply vs. a for loop.

    4. For the sake of fast & easy data analysis, I often work first with a small random subset (sample ()) of the data. Once the data analysis script/.Rnw is finished data analysis code and the complete data go to the calculation server for over night / over weekend / ... calculation.

    0 讨论(0)
  • 2020-11-22 01:54

    Tip for dealing with objects requiring heavy intermediate calculation: When using objects that require a lot of heavy calculation and intermediate steps to create, I often find it useful to write a chunk of code with the function to create the object, and then a separate chunk of code that gives me the option either to generate and save the object as an rmd file, or load it externally from an rmd file I have already previously saved. This is especially easy to do in R Markdown using the following code-chunk structure.

    ```{r Create OBJECT}
    
    COMPLICATED.FUNCTION <- function(...) { Do heavy calculations needing lots of memory;
                                            Output OBJECT; }
    
    ```
    ```{r Generate or load OBJECT}
    
    LOAD <- TRUE;
    #NOTE: Set LOAD to TRUE if you want to load saved file
    #NOTE: Set LOAD to FALSE if you want to generate and save
    
    if(LOAD == TRUE) { OBJECT <- readRDS(file = 'MySavedObject.rds'); } else
                     { OBJECT <- COMPLICATED.FUNCTION(x, y, z);
                                 saveRDS(file = 'MySavedObject.rds', object = OBJECT); }
    
    ```
    

    With this code structure, all I need to do is to change LOAD depending on whether I want to generate and save the object, or load it directly from an existing saved file. (Of course, I have to generate it and save it the first time, but after this I have the option of loading it.) Setting LOAD = TRUE bypasses use of my complicated function and avoids all of the heavy computation therein. This method still requires enough memory to store the object of interest, but it saves you from having to calculate it each time you run your code. For objects that require a lot of heavy calculation of intermediate steps (e.g., for calculations involving loops over large arrays) this can save a substantial amount of time and computation.

    0 讨论(0)
  • 2020-11-22 01:54

    I try to keep the amount of objects small when working in a larger project with a lot of intermediate steps. So instead of creating many unique objects called

    dataframe-> step1 -> step2 -> step3 -> result

    raster-> multipliedRast -> meanRastF -> sqrtRast -> resultRast

    I work with temporary objects that I call temp.

    dataframe -> temp -> temp -> temp -> result

    Which leaves me with less intermediate files and more overview.

    raster  <- raster('file.tif')
    temp <- raster * 10
    temp <- mean(temp)
    resultRast <- sqrt(temp)
    

    To save more memory I can simply remove temp when no longer needed.

    rm(temp)
    

    If I need several intermediate files, I use temp1, temp2, temp3.

    For testing I use test, test2, ...

    0 讨论(0)
提交回复
热议问题