Recommended package for very large dataset processing and machine learning in R [closed]

前端未结

关注

 5  1050

旧巷少年郎

相关标签:

5条回答

一整个雨季

2020-12-04 09:01

If the memory is not sufficient enough, one solution is push data to disk and using distributed computing. I think RHadoop(R+Hadoop) may be one of the solution to tackle with large amount dataset.

0 讨论(0)
发布评论:

提交评论
- 加载中...
终归单人心

2020-12-04 09:14

Have a look at the "Large memory and out-of-memory data" subsection of the high performance computing task view on CRAN. bigmemory and ff are two popular packages. For bigmemory (and the related biganalytics, and bigtabulate), the bigmemory website has a few very good presentations, vignettes, and overviews from Jay Emerson. For ff, I recommend reading Adler Oehlschlägel and colleagues' excellent slide presentations on the ff website.

Also, consider storing data in a database and reading in smaller batches for analysis. There are likely any number of approaches to consider. To get started, consdier looking through some of the examples in the biglm package, as well as this presentation from Thomas Lumley.

And do investigate the other packages on the high-performance computing task view and mentioned in the other answers. The packages I mention above are simply the ones I've happened to have more experience with.

0 讨论(0)
发布评论:

提交评论
- 加载中...
你的背包

2020-12-04 09:15

I think the amount of data you can process is more limited by ones programming skills than anything else. Although a lot of standard functionality is focused on in memory analysis, cutting your data into chunks already helps a lot. Ofcourse, this takes more time to program than picking up standard R code, but often times it is quite possible.

Cutting up data can for exale be done using read.table or readBin which support only reading a subset of the data. Alternatively, you can take a look at the high performance computing task view for packages which deliver out of the box out of memory functionality. You could also put your data in a database. For spatial raster data, the excellent raster package provides out of memory analysis.

0 讨论(0)
发布评论:

提交评论
- 加载中...
小蘑菇

2020-12-04 09:15

For machine learning tasks I can recommend using biglm package, used to do "Regression for data too large to fit in memory". For using R with really big data, one can use Hadoop as a backend and then use package rmr to perform statistical (or other) analysis via MapReduce on a Hadoop cluster.

0 讨论(0)
发布评论:

提交评论
- 加载中...
清歌不尽

2020-12-04 09:26

It all depends on algorithms you need. If they may be translated into incremental form (when only small part of data is needed at any given moment, e.g. for Naive Bayes you can hold in memory only the model itself and current observation being processed), then the best suggestion is to perform machine learning incrementally, reading new batches of data from disk.

However, many algorithms and especially their implementations really require the whole dataset. If size of the dataset fits you disk (and file system limitations), you can use mmap package that allows to map file on disk to memory and use it in the program. Note however, that read-writes to disk are expensive, and R sometimes likes to move data back and forth frequently. So be careful.

If your data can't be stored even on you hard drive, you will need to use distributed machine learning systems. One such R-based system is Revolution R which is designed to handle really large datasets. Unfortunately, it is not open source and costs quite a lot of money, but you may try to get free academic license. As alternative, you may be interested in Java-based Apache Mahout - not so elegant, but very efficient solution, based on Hadoop and including many important algorithms.

0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题