Plotting of very large data sets in R

前端 未结 8 1382
陌清茗
陌清茗 2021-01-31 04:16

How can I plot a very large data set in R?

I\'d like to use a boxplot, or violin plot, or similar. All the data cannot be fit in memory. Can I incrementally read in and

8条回答
  •  囚心锁ツ
    2021-01-31 04:19

    Problem is you can't load all data into the memory. So you could do sampling of the data, as indicated earlier by @Marek. On such a huge datasets, you get essentially the same results even if you take only 1% of the data. For the violin plot, this will give you a decent estimate of the density. Progressive calculation of quantiles is impossible, but this should give a very decent approximation. It is essentially the same as the "randomized method" described in the link @aix gave.

    If you can't subset the date outside of R, it can be done using connections in combination with sample(). Following function is what I use to sample data from a dataframe in text format when it's getting too big. If you play a bit with the connection, you could easily convert this to a socketConnection or other to read it from a server, a database, whatever. Just make sure you open the connection in the correct mode.

    Good, take a simple .csv file, then following function samples a fraction p of the data:

    sample.df <- function(f,n=10000,split=",",p=0.1){
        con <- file(f,open="rt",)
        on.exit(close(con,type="rt"))
        y <- data.frame()
        #read header
        x <- character(0)
        while(length(x)==0){
          x <- strsplit(readLines(con,n=1),split)[[1]]
        }
        Names <- x
        #read and process data
        repeat{
          x <- tryCatch(read.table(con,nrows=n,sep=split),error = function(e) NULL )
          if(is.null(x)) {break}
          names(x) <- Names
          nn <- nrow(x)
          id <- sample(1:nn,round(nn*p))
          y <- rbind(y,x[id,])
        }
        rownames(y) <- NULL
        return(y)
    }
    

    An example of the usage :

    #Make a file
    Df <- data.frame(
      X1=1:10000,
      X2=1:10000,
      X3=rep(letters[1:10],1000)
    )
    write.csv(Df,file="test.txt",row.names=F,quote=F)
    
    # n is number of lines to be read at once, p is the fraction to sample
    DF2 <- sample.df("test.txt",n=1000,p=0.2)
    str(DF2)
    
    #clean up
    unlink("test.txt")
    

提交回复
热议问题