Boosting ggplot2 performance

后端 未结 1 1979
生来不讨喜
生来不讨喜 2021-02-05 05:37

The ggplot2 package is easily the best plotting system I ever worked with, except that the performance is not really good for larger datasets (~50k points). I\'m lo

相关标签:
1条回答
  • Hadley had a cool talk about his new packages dplyr and ggvis at user2013. But he can probably better tell more about that himself.

    I'm not sure what your application design looks like, but I often do in-database pre-processing before feeding the data to R. For example, if you are plotting time series, there is really no need to show every second of the day on the X axis. Instead you might want to aggregate and get the min/max/mean over e.g. one or five minute time intervals.

    Below an example of a function I wrote years ago that did something like that in SQL. This particular example uses the modulo operator because times were stored as epoch millis. But if data in SQL are properly stored as date/datetime structures, SQL has some more elegant native methods to aggregate by time periods.

    #' @param table name of the table
    #' @param start start time/date
    #' @param end end time/date
    #' @param aggregate one of "days", "hours", "mins" or "weeks"
    #' @param group grouping variable
    #' @param column name of the target column (y axis)
    #' @export
    minmaxdata <- function(table, start, end, aggregate=c("days", "hours", "mins", "weeks"), group=1, column){
    
      #dates
      start <- round(unclass(as.POSIXct(start))*1000);
      end <- round(unclass(as.POSIXct(end))*1000);
    
      #must aggregate
      aggregate <- match.arg(aggregate);
    
      #calcluate modulus
      mod <- switch(aggregate,
        "mins"   = 1000*60,
        "hours"  = 1000*60*60,
        "days"   = 1000*60*60*24,
        "weeks"  = 1000*60*60*24*7,
        stop("invalid aggregate value")
      );
    
      #we need to add the time differene between gmt and pst to make modulo work
      delta <- 1000 * 60 * 60 * (24 - unclass(as.POSIXct(format(Sys.time(), tz="GMT")) - Sys.time()));  
    
      #form query
      query <- paste("SELECT", group, "AS grouping, AVG(", column, ") AS yavg, MAX(", column, ") AS ymax, MIN(", column, ") AS ymin, ((CMilliseconds_g +", delta, ") DIV", mod, ") AS timediv FROM", table, "WHERE CMilliseconds_g BETWEEN", start, "AND", end, "GROUP BY", group, ", timediv;")
      mydata <- getquery(query);
    
      #data
      mydata$time <- structure(mod*mydata[["timediv"]]/1000 - delta/1000, class=c("POSIXct", "POSIXt"));
      mydata$grouping <- as.factor(mydata$grouping)
    
      #round timestamps
      if(aggregate %in% c("mins", "hours")){
        mydata$time <- round(mydata$time, aggregate)
      } else {
        mydata$time <- as.Date(mydata$time);
      }
    
      #return
      return(mydata)
    }
    
    0 讨论(0)
提交回复
热议问题