问题
How can I plot a very large data set in R?
I'd like to use a boxplot, or violin plot, or similar. All the data cannot be fit in memory. Can I incrementally read in and calculate the summaries needed to make these plots? If so how?
回答1:
In supplement to my comment to Dmitri answer, a function to calculate quantiles using ff
big-data handling package:
ffquantile<-function(ffv,qs=c(0,0.25,0.5,0.75,1),...){
stopifnot(all(qs<=1 & qs>=0))
ffsort(ffv,...)->ffvs
j<-(qs*(length(ffv)-1))+1
jf<-floor(j);ceiling(j)->jc
rowSums(matrix(ffvs[c(jf,jc)],length(qs),2))/2
}
This is an exact algorithm, so it uses sorting -- and thus may take a lot of time.
回答2:
Problem is you can't load all data into the memory. So you could do sampling of the data, as indicated earlier by @Marek. On such a huge datasets, you get essentially the same results even if you take only 1% of the data. For the violin plot, this will give you a decent estimate of the density. Progressive calculation of quantiles is impossible, but this should give a very decent approximation. It is essentially the same as the "randomized method" described in the link @aix gave.
If you can't subset the date outside of R, it can be done using connections in combination with sample()
. Following function is what I use to sample data from a dataframe in text format when it's getting too big. If you play a bit with the connection, you could easily convert this to a socketConnection or other to read it from a server, a database, whatever. Just make sure you open the connection in the correct mode.
Good, take a simple .csv file, then following function samples a fraction p of the data:
sample.df <- function(f,n=10000,split=",",p=0.1){
con <- file(f,open="rt",)
on.exit(close(con,type="rt"))
y <- data.frame()
#read header
x <- character(0)
while(length(x)==0){
x <- strsplit(readLines(con,n=1),split)[[1]]
}
Names <- x
#read and process data
repeat{
x <- tryCatch(read.table(con,nrows=n,sep=split),error = function(e) NULL )
if(is.null(x)) {break}
names(x) <- Names
nn <- nrow(x)
id <- sample(1:nn,round(nn*p))
y <- rbind(y,x[id,])
}
rownames(y) <- NULL
return(y)
}
An example of the usage :
#Make a file
Df <- data.frame(
X1=1:10000,
X2=1:10000,
X3=rep(letters[1:10],1000)
)
write.csv(Df,file="test.txt",row.names=F,quote=F)
# n is number of lines to be read at once, p is the fraction to sample
DF2 <- sample.df("test.txt",n=1000,p=0.2)
str(DF2)
#clean up
unlink("test.txt")
回答3:
You should also look at the RSQLite, SQLiteDF, RODBC, and biglm packages. For large datasets is can be useful to store the data in a database and pull only pieces into R. The databases can also do sorting for you and then computing quantiles on sorted data is much simpler (then just use the quantiles to do the plots).
There is also the hexbin package (bioconductor) for doing scatterplot equivalents with very large datasets (probably still want to use a sample of the data, but works with a large sample).
回答4:
You could put the data into a database and calculate the quantiles using SQL. See : http://forge.mysql.com/tools/tool.php?id=149
回答5:
All you need for a boxplot are the quantiles, the "whisker" extremes, and the outliers (if shown), which is all easily precomputed. Take a look at the boxplot.stats
function.
回答6:
You could make plots from manageable sample of your data. E.g. if you use only 10% randomly chosen rows then boxplot on this sample shouldn't differ from all-data boxplot.
If your data are on some database there you be able to create some random flag (as I know almost every database engine has some kind of random number generator).
Second thing is how large is your dataset? For boxplot you need two columns: value variable and group variable. This example:
N <- 1e6
x <- rnorm(N)
b <- sapply(1:100, function(i) paste(sample(letters,40,TRUE),collapse=""))
g <- factor(sample(b,N,TRUE))
boxplot(x~g)
needs 100MB of RAM. If N=1e7
then it uses <1GB of RAM (which is still manageable to modern machine).
回答7:
This is an interesting problem.
Boxplots require quantiles. Computing quantiles on very large datasets is tricky.
The simplest solution that may or may not work in your case is to downsample the data first, and produce plots of the sample. In other words, read a bunch of records at a time, and retain a subset of them in memory (choosing either deterministically or randomly.) At the end, produce plots based on the data that's been retained in memory. Again, whether or not this is viable very much depends on the properties of your data.
Alternatively, there exist algorithms that can economically and approximately compute quantiles in an "online" fashion, meaning that they are presented with one observation at a time, and each observation is shown exactly once. While I have some limited experience with such algorithms, I have not seen any readily-available R implementations.
The following paper presents a brief overview of some relevant algorithms: Quantiles on Streams.
回答8:
Perhaps you can think about using disk.frame to summarise the data down first before running the plotting?
来源:https://stackoverflow.com/questions/4341052/plotting-of-very-large-data-sets-in-r