How can I plot a very large data set in R?
I\'d like to use a boxplot, or violin plot, or similar. All the data cannot be fit in memory. Can I incrementally read in and
In supplement to my comment to Dmitri answer, a function to calculate quantiles using ff
big-data handling package:
ffquantile<-function(ffv,qs=c(0,0.25,0.5,0.75,1),...){
stopifnot(all(qs<=1 & qs>=0))
ffsort(ffv,...)->ffvs
j<-(qs*(length(ffv)-1))+1
jf<-floor(j);ceiling(j)->jc
rowSums(matrix(ffvs[c(jf,jc)],length(qs),2))/2
}
This is an exact algorithm, so it uses sorting -- and thus may take a lot of time.
This is an interesting problem.
Boxplots require quantiles. Computing quantiles on very large datasets is tricky.
The simplest solution that may or may not work in your case is to downsample the data first, and produce plots of the sample. In other words, read a bunch of records at a time, and retain a subset of them in memory (choosing either deterministically or randomly.) At the end, produce plots based on the data that's been retained in memory. Again, whether or not this is viable very much depends on the properties of your data.
Alternatively, there exist algorithms that can economically and approximately compute quantiles in an "online" fashion, meaning that they are presented with one observation at a time, and each observation is shown exactly once. While I have some limited experience with such algorithms, I have not seen any readily-available R implementations.
The following paper presents a brief overview of some relevant algorithms: Quantiles on Streams.