Plotting of very large data sets in R

前端 未结 8 1383
陌清茗
陌清茗 2021-01-31 04:16

How can I plot a very large data set in R?

I\'d like to use a boxplot, or violin plot, or similar. All the data cannot be fit in memory. Can I incrementally read in and

相关标签:
8条回答
  • 2021-01-31 04:19

    Problem is you can't load all data into the memory. So you could do sampling of the data, as indicated earlier by @Marek. On such a huge datasets, you get essentially the same results even if you take only 1% of the data. For the violin plot, this will give you a decent estimate of the density. Progressive calculation of quantiles is impossible, but this should give a very decent approximation. It is essentially the same as the "randomized method" described in the link @aix gave.

    If you can't subset the date outside of R, it can be done using connections in combination with sample(). Following function is what I use to sample data from a dataframe in text format when it's getting too big. If you play a bit with the connection, you could easily convert this to a socketConnection or other to read it from a server, a database, whatever. Just make sure you open the connection in the correct mode.

    Good, take a simple .csv file, then following function samples a fraction p of the data:

    sample.df <- function(f,n=10000,split=",",p=0.1){
        con <- file(f,open="rt",)
        on.exit(close(con,type="rt"))
        y <- data.frame()
        #read header
        x <- character(0)
        while(length(x)==0){
          x <- strsplit(readLines(con,n=1),split)[[1]]
        }
        Names <- x
        #read and process data
        repeat{
          x <- tryCatch(read.table(con,nrows=n,sep=split),error = function(e) NULL )
          if(is.null(x)) {break}
          names(x) <- Names
          nn <- nrow(x)
          id <- sample(1:nn,round(nn*p))
          y <- rbind(y,x[id,])
        }
        rownames(y) <- NULL
        return(y)
    }
    

    An example of the usage :

    #Make a file
    Df <- data.frame(
      X1=1:10000,
      X2=1:10000,
      X3=rep(letters[1:10],1000)
    )
    write.csv(Df,file="test.txt",row.names=F,quote=F)
    
    # n is number of lines to be read at once, p is the fraction to sample
    DF2 <- sample.df("test.txt",n=1000,p=0.2)
    str(DF2)
    
    #clean up
    unlink("test.txt")
    
    0 讨论(0)
  • 2021-01-31 04:22

    Perhaps you can think about using disk.frame to summarise the data down first before running the plotting?

    0 讨论(0)
  • 2021-01-31 04:29

    All you need for a boxplot are the quantiles, the "whisker" extremes, and the outliers (if shown), which is all easily precomputed. Take a look at the boxplot.stats function.

    0 讨论(0)
  • 2021-01-31 04:29

    You could make plots from manageable sample of your data. E.g. if you use only 10% randomly chosen rows then boxplot on this sample shouldn't differ from all-data boxplot.

    If your data are on some database there you be able to create some random flag (as I know almost every database engine has some kind of random number generator).

    Second thing is how large is your dataset? For boxplot you need two columns: value variable and group variable. This example:

    N <- 1e6
    x <- rnorm(N)
    b <- sapply(1:100, function(i) paste(sample(letters,40,TRUE),collapse=""))
    g <- factor(sample(b,N,TRUE))
    boxplot(x~g)
    

    needs 100MB of RAM. If N=1e7 then it uses <1GB of RAM (which is still manageable to modern machine).

    0 讨论(0)
  • 2021-01-31 04:31

    You should also look at the RSQLite, SQLiteDF, RODBC, and biglm packages. For large datasets is can be useful to store the data in a database and pull only pieces into R. The databases can also do sorting for you and then computing quantiles on sorted data is much simpler (then just use the quantiles to do the plots).

    There is also the hexbin package (bioconductor) for doing scatterplot equivalents with very large datasets (probably still want to use a sample of the data, but works with a large sample).

    0 讨论(0)
  • 2021-01-31 04:37

    You could put the data into a database and calculate the quantiles using SQL. See : http://forge.mysql.com/tools/tool.php?id=149

    0 讨论(0)
提交回复
热议问题