Tricks to manage the available memory in an R session

前端 未结 27 1359
情深已故
情深已故 2020-11-22 01:23

What tricks do people use to manage the available memory of an interactive R session? I use the functions below [based on postings by Petr Pikal and David Hinds to the r-he

相关标签:
27条回答
  • 2020-11-22 01:40

    With only 4GB of RAM (running Windows 10, so make that about 2 or more realistically 1GB) I've had to be real careful with the allocation.

    I use data.table almost exclusively.

    The 'fread' function allows you to subset information by field names on import; only import the fields that are actually needed to begin with. If you're using base R read, null the spurious columns immediately after import.

    As 42- suggests, where ever possible I will then subset within the columns immediately after importing the information.

    I frequently rm() objects from the environment as soon as they're no longer needed, e.g. on the next line after using them to subset something else, and call gc().

    'fread' and 'fwrite' from data.table can be very fast by comparison with base R reads and writes.

    As kpierce8 suggests, I almost always fwrite everything out of the environment and fread it back in, even with thousand / hundreds of thousands of tiny files to get through. This not only keeps the environment 'clean' and keeps the memory allocation low but, possibly due to the severe lack of RAM available, R has a propensity for frequently crashing on my computer; really frequently. Having the information backed up on the drive itself as the code progresses through various stages means I don't have to start right from the beginning if it crashes.

    As of 2017, I think the fastest SSDs are running around a few GB per second through the M2 port. I have a really basic 50GB Kingston V300 (550MB/s) SSD that I use as my primary disk (has Windows and R on it). I keep all the bulk information on a cheap 500GB WD platter. I move the data sets to the SSD when I start working on them. This, combined with 'fread'ing and 'fwrite'ing everything has been working out great. I've tried using 'ff' but prefer the former. 4K read/write speeds can create issues with this though; backing up a quarter of a million 1k files (250MBs worth) from the SSD to the platter can take hours. As far as I'm aware, there isn't any R package available yet that can automatically optimise the 'chunkification' process; e.g. look at how much RAM a user has, test the read/write speeds of the RAM / all the drives connected and then suggest an optimal 'chunkification' protocol. This could produce some significant workflow improvements / resource optimisations; e.g. split it to ... MB for the ram -> split it to ... MB for the SSD -> split it to ... MB on the platter -> split it to ... MB on the tape. It could sample data sets beforehand to give it a more realistic gauge stick to work from.

    A lot of the problems I've worked on in R involve forming combination and permutation pairs, triples etc, which only makes having limited RAM more of a limitation as they will often at least exponentially expand at some point. This has made me focus a lot of attention on the quality as opposed to quantity of information going into them to begin with, rather than trying to clean it up afterwards, and on the sequence of operations in preparing the information to begin with (starting with the simplest operation and increasing the complexity); e.g. subset, then merge / join, then form combinations / permutations etc.

    There do seem to be some benefits to using base R read and write in some instances. For instance, the error detection within 'fread' is so good it can be difficult trying to get really messy information into R to begin with to clean it up. Base R also seems to be a lot easier if you're using Linux. Base R seems to work fine in Linux, Windows 10 uses ~20GB of disc space whereas Ubuntu only needs a few GB, the RAM needed with Ubuntu is slightly lower. But I've noticed large quantities of warnings and errors when installing third party packages in (L)Ubuntu. I wouldn't recommend drifting too far away from (L)Ubuntu or other stock distributions with Linux as you can loose so much overall compatibility it renders the process almost pointless (I think 'unity' is due to be cancelled in Ubuntu as of 2017). I realise this won't go down well with some Linux users but some of the custom distributions are borderline pointless beyond novelty (I've spent years using Linux alone).

    Hopefully some of that might help others out.

    0 讨论(0)
  • 2020-11-22 01:42

    If you really want to avoid the leaks, you should avoid creating any big objects in the global environment.

    What I usually do is to have a function that does the job and returns NULL — all data is read and manipulated in this function or others that it calls.

    0 讨论(0)
  • 2020-11-22 01:44

    That's a good trick.

    One other suggestion is to use memory efficient objects wherever possible: for instance, use a matrix instead of a data.frame.

    This doesn't really address memory management, but one important function that isn't widely known is memory.limit(). You can increase the default using this command, memory.limit(size=2500), where the size is in MB. As Dirk mentioned, you need to be using 64-bit in order to take real advantage of this.

    0 讨论(0)
  • 2020-11-22 01:46

    As well as the more general memory management techniques given in the answers above, I always try to reduce the size of my objects as far as possible. For example, I work with very large but very sparse matrices, in other words matrices where most values are zero. Using the 'Matrix' package (capitalisation important) I was able to reduce my average object sizes from ~2GB to ~200MB as simply as:

    my.matrix <- Matrix(my.matrix)
    

    The Matrix package includes data formats that can be used exactly like a regular matrix (no need to change your other code) but are able to store sparse data much more efficiently, whether loaded into memory or saved to disk.

    Additionally, the raw files I receive are in 'long' format where each data point has variables x, y, z, i. Much more efficient to transform the data into an x * y * z dimension array with only variable i.

    Know your data and use a bit of common sense.

    0 讨论(0)
  • 2020-11-22 01:47

    Unfortunately I did not have time to test it extensively but here is a memory tip that I have not seen before. For me the required memory was reduced with more than 50%. When you read stuff into R with for example read.csv they require a certain amount of memory. After this you can save them with save("Destinationfile",list=ls()) The next time you open R you can use load("Destinationfile") Now the memory usage might have decreased. It would be nice if anyone could confirm whether this produces similar results with a different dataset.

    0 讨论(0)
  • Running

    for (i in 1:10) 
        gc(reset = T)
    

    from time to time also helps R to free unused but still not released memory.

    0 讨论(0)
提交回复
热议问题