Plotting of very large data sets in R

前端未结

关注

 8  1484

陌清茗 2021-01-31 04:16

How can I plot a very large data set in R?

I\'d like to use a boxplot, or violin plot, or similar. All the data cannot be fit in memory. Can I incrementally read in and

8条回答

囚心锁ツ (楼主)

2021-01-31 04:19
Problem is you can't load all data into the memory. So you could do sampling of the data, as indicated earlier by @Marek. On such a huge datasets, you get essentially the same results even if you take only 1% of the data. For the violin plot, this will give you a decent estimate of the density. Progressive calculation of quantiles is impossible, but this should give a very decent approximation. It is essentially the same as the "randomized method" described in the link @aix gave.

If you can't subset the date outside of R, it can be done using connections in combination with sample(). Following function is what I use to sample data from a dataframe in text format when it's getting too big. If you play a bit with the connection, you could easily convert this to a socketConnection or other to read it from a server, a database, whatever. Just make sure you open the connection in the correct mode.

Good, take a simple .csv file, then following function samples a fraction p of the data:
```
sample.df <- function(f,n=10000,split=",",p=0.1){
    con <- file(f,open="rt",)
    on.exit(close(con,type="rt"))
    y <- data.frame()
    #read header
    x <- character(0)
    while(length(x)==0){
      x <- strsplit(readLines(con,n=1),split)[[1]]
    }
    Names <- x
    #read and process data
    repeat{
      x <- tryCatch(read.table(con,nrows=n,sep=split),error = function(e) NULL )
      if(is.null(x)) {break}
      names(x) <- Names
      nn <- nrow(x)
      id <- sample(1:nn,round(nn*p))
      y <- rbind(y,x[id,])
    }
    rownames(y) <- NULL
    return(y)
}
```
An example of the usage :
```
#Make a file
Df <- data.frame(
  X1=1:10000,
  X2=1:10000,
  X3=rep(letters[1:10],1000)
)
write.csv(Df,file="test.txt",row.names=F,quote=F)

# n is number of lines to be read at once, p is the fraction to sample
DF2 <- sample.df("test.txt",n=1000,p=0.2)
str(DF2)

#clean up
unlink("test.txt")
```
0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...