aggregate 1-minute data into 5-minute average data

前端 未结 3 703
有刺的猬
有刺的猬 2020-12-01 05:35

My question here is to aggregate the data collected at every 1-minute into 5-minute average.

DeviceTime         Concentration
6/20/2013 11:13       
6/20/201         


        
相关标签:
3条回答
  • 2020-12-01 06:13

    If your data doesn't start on a nice 5-minute wall clock boundary (as shown in your sample data – 11:13), note that cut() will create breakpoints based on the first timestamp it finds. This probably isn't what we normally want. Indeed, your sample output indicates this is not what you want.

    Here's what cut() does:

    df <- read.table(header=TRUE, sep=",", stringsAsFactors=FALSE, text="
    DeviceTime,Concentration
    6/20/2013 11:13,1
    6/20/2013 11:14,1
    6/20/2013 11:15,2
    6/20/2013 11:16,2
    6/20/2013 11:17,2
    6/20/2013 11:18,2
    6/20/2013 11:19,2
    6/20/2013 11:20,3
    6/20/2013 11:21,3
    6/20/2013 11:22,3
    6/20/2013 11:23,3
    6/20/2013 11:24,3
    6/20/2013 11:25,4")
    df$DeviceTime <- as.POSIXct(df$DeviceTime, format="%m/%d/%Y %H:%M")
    
    cut(df$DeviceTime, breaks="5 min")
     [1] 2013-06-20 11:13:00 2013-06-20 11:13:00 2013-06-20 11:13:00
     [4] 2013-06-20 11:13:00 2013-06-20 11:13:00 2013-06-20 11:18:00
     [7] 2013-06-20 11:18:00 2013-06-20 11:18:00 2013-06-20 11:18:00
    [10] 2013-06-20 11:18:00 2013-06-20 11:23:00 2013-06-20 11:23:00
    [13] 2013-06-20 11:23:00
    
    means <- aggregate(df["Concentration"], 
                       list(fiveMin=cut(df$DeviceTime, "5 mins")),
                       mean)
    means
                  fiveMin Concentration
    1 2013-06-20 11:13:00      1.600000
    2 2013-06-20 11:18:00      2.600000
    3 2013-06-20 11:23:00      3.333333
    

    Notice that the first row of means (the 11:13:00 entry) is the mean of the first 5 rows of df, which have times of 11:13 to 11:17 -- i.e., up until just before the next cut/break point of 11:18.

    You'll get the same result with dplyr (i.e., @lukeA's answer) if you use cut():

    df %>%
      group_by(DeviceTime = cut(DeviceTime, breaks="5 min")) %>%
      summarize(Concentration = mean(Concentration))
    Source: local data frame [3 x 2]
    
               DeviceTime Concentration
    1 2013-06-20 11:13:00      1.600000
    2 2013-06-20 11:18:00      2.600000
    3 2013-06-20 11:23:00      3.333333
    

    The xts package seems to break by wall clock time:

    require(xts)
    df.xts <- xts(df$Concentration, df$DeviceTime)
    means.xts <- period.apply(df.xts, endpoints(df.xts, "mins", k=5), mean)
    means.xts
                        [,1]
    2013-06-20 11:14:00    1
    2013-06-20 11:19:00    2
    2013-06-20 11:24:00    3
    2013-06-20 11:25:00    4
    

    The time values are always the last time entry found in the 5-min window. You can round the time index column up the the next 5-min boundary with align.time(), if you want to report the times of the end of the periods:

    means.rounded <- align.time(means.xts, 5*60)
    means.rounded
                        [,1]
    2013-06-20 11:15:00    1
    2013-06-20 11:20:00    2
    2013-06-20 11:25:00    3
    2013-06-20 11:30:00    4
    

    You can also round down, if you want to report the times of the beginning of the periods. But you'll need to define your own function first (which I found on Cross Validated):

    align.time.down = function(x,n) {
        index(x) = index(x) - n
        align.time(x,n)
    }
    means.rounded.down <- align.time.down(means.xts, 5*60)
    means.rounded.down
                        [,1]
    2013-06-20 11:10:00    1
    2013-06-20 11:15:00    2
    2013-06-20 11:20:00    3
    2013-06-20 11:25:00    4
    

    Another solution, that doesn't use the xts package, but rather floor(), is as follows:

    df$DeviceTimeFloor <- as.POSIXct(floor(as.numeric(df$DeviceTime) / (5 * 60)) * (5 * 60), origin='1970-01-01')
    meansFloor <- aggregate(Concentration ~ DeviceTimeFloor, df, mean)
    meansFloor
          DeviceTimeFloor Concentration
    1 2013-06-20 11:10:00             1
    2 2013-06-20 11:15:00             2
    3 2013-06-20 11:20:00             3
    4 2013-06-20 11:25:00             4
    

    I prefer to report the start time of the 5-minute interval – floor() is good for this. Because, if I were to report aggregates by hour, I would expect a timestamp of 2013-06-20 11:00:00 to contain data for the period 11:00:00 - 11:59:59 not 10:00:00 - 10:59:59.

    If you prefer to report the end time of the intervals, ceiling() can be used instead of floor(). But note that timestamps 11:01 - 11:05 will be converted to (and hence grouped at) 11:05 by ceiling(). In contrast, floor() converts 11:00 - 11:04 to 11:00.

    So they each group a different set of observations. The xts package will group the same set of observations as floor(), but it will report the last timestamp of the last observation in the period.

    0 讨论(0)
  • 2020-12-01 06:35

    I'd say the easiest and cleanest way to do this is using the lubridate and dplyr packages.

    library(lubridate)  # for working with dates
    library(dplyr)      # for manipulating data
    
    df$DeviceTime5min <- floor_date(df$DeviceTime, "5 mins")
    df_5min <- df %>% group_by(DeviceTime5min) %>% summarize(mean(Concentration))
    

    Only problem here is that it works just for values, that fit into an hour ... that is: 1, 2, 3, 4, 5, 6, 10, 12, 15, 20, 30, 60 min. But for these it works perfect :-)

    0 讨论(0)
  • 2020-12-01 06:37

    Using the dplyr package and assuming, your data is stored in a data frame named df:

    require(dplyr)
    df %>%
      group_by(DeviceTime = cut(DeviceTime, breaks="5 min")) %>%
      summarize(Concentration = mean(Concentration))
    
    0 讨论(0)
提交回复
热议问题