optimized rolling functions on irregular time series with time-based window

前端 未结 5 804
情深已故
情深已故 2020-12-01 09:42

Is there some way to use rollapply (from zoo package or something similar) optimized functions (rollmean, rollmedian etc) to compute r

相关标签:
5条回答
  • 2020-12-01 09:46

    Most of the answers suggest to insert NA to make the time series regular. However, this can be slow in case of long time series. Additionally, it does not work for functions which can not be used with NA.

    The width argument of rollapply (zoo package) can be a list (see help of rollapply for details). Based on this I wrote a function which creates a list to be used with rollapply as width parameter. The function extracts indexes for irregular zoo objects if the moving window is to be time and not index based. Therefore the index of the zoo object should be the actual time.

    # Create a zoo object where index represents time (e.g. in seconds) 
    
    d <- zoo(c(1,1,1,1,1,2,2,2,2,2,16,25,27,27,27,27,27,31),     
             c(1:5,11:15,16,25:30,31))
    
    # Create function 
    
    createRollapplyWidth = function(zoodata, steps, window ){   
    
      mintime =  min(time(zoodata))     
    
      maxtime =  max(time(zoodata)) 
    
      spotstime = seq(from = mintime , to = maxtime, by = steps)
    
      spotsindex = list() 
    
        for (i in 1:length(spotstime)){
        spotsindex[[i]] =  as.numeric(which(spotstime[i]  <=  time(zoodata) & time(zoodata) < spotstime[i] + window))}
    
      rollapplywidth = list()
        for (i in 1:length(spotsindex)){
        if (!is.na(median(spotsindex[[i]])) ){ 
          rollapplywidth[[round(median(spotsindex[[i]]))]] = spotsindex[[i]] - round(median(spotsindex[[i]]))}
      }
      return(rollapplywidth)
      }
    
    
    # Create width parameter for rollapply using function
    
    rollwidth =  createRollapplyWidth(zoodata = d, steps = 5, window = 5) 
    
    # Use parameter in rollapply 
    
    result = rollapply(d, width = rollwidth , FUN =  sum, na.rm = T) 
    result
    

    Limitation: not based on dated but on time in seconds. Parameter "partial" of rollapply does not work.

    0 讨论(0)
  • 2020-12-01 09:59

    I recommend using runner package which is optimized to do operation requested in this topic. Go to section Windows depending on date in documentation, for further explanation.

    To solve your task, one can use runner function which can execute any R function in running windows. One-liner here:

    df <- read.table(
      text = "date  value
       2011-11-01      5
       2011-11-01      4
       2011-11-01      2
       2011-11-08      1
       2011-11-13      0
       2011-11-14      0
       2011-11-15      0
       2011-11-18      1
       2011-11-21      4
       2011-12-05      3", header = TRUE, colClasses = c("Date", "integer"))
    
    library(runner)
    runner(df$value, k = 5, idx = df$date, f = median)
    [1] 5.0 4.5 4.0 1.0 0.0 0.0 0.0 0.0 2.5 3.0
    

    P.S. one should be aware that 5-days window is [i-4, i-3, i-2, i-1, i] instead of (i-5):i (6-days window). Illustration below for better explanation of the concept.
    I've made example on 5-days window but if one want to reproduce result as OP requested, can specify 6-days window:

    identical(
      runner(df$value, k = 6, idx = df$date, f = median),
      c(5.0, 4.5, 4.0, 1.0, 0.5, 0.0, 0.0, 0.0, 2.5, 3.0)
    )
    # [1] TRUE
    
    0 讨论(0)
  • 2020-12-01 10:03

    As of version v1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to perform non-equi joins which can be used here.

    The OP has requested

    for each element in an irregular time series, I want to compute a rolling function with a N-days window. That is, the window should include all the observations up to N days before the current observation. Time series may also contain duplicates.

    Note that the OP has requested to include all the observations up to N days before the current observation. This is different to request all the observations up to N days before the current day.

    For the latter, I would expect one value for 1/11/2011, i.e., median(c(5, 4, 2)) = 4.

    Apparently, the OP expects an observation-based rolling window which is limited to N days. Therefore, the join conditions of the non-equi join have to consider the row number as well.

    library(data.table)
    n_days <- 5L
    setDT(DT)[, rn := .I][
      .(ur = rn, ud = date, ld = date - n_days), 
      on = .(rn <= ur, date <= ud, date >= ld),
      median(as.double(value)), by = .EACHI]$V1
    
    [1] 5.0 4.5 4.0 1.0 0.5 0.0 0.0 0.0 2.5 3.0
    

    For the sake of completeness, a possible solution for the day-based rolling window could be:

    setDT(DT)[.(ud = unique(date), ld = unique(date) - n_days), on = .(date <= ud, date >= ld), 
       median(as.double(value)), by = .EACHI]
    
             date       date  V1
    1: 2011-11-01 2011-10-27 4.0
    2: 2011-11-08 2011-11-03 1.0
    3: 2011-11-13 2011-11-08 0.5
    4: 2011-11-14 2011-11-09 0.0
    5: 2011-11-15 2011-11-10 0.0
    6: 2011-11-18 2011-11-13 0.0
    7: 2011-11-21 2011-11-16 2.5
    8: 2011-12-05 2011-11-30 3.0
    

    Data

    library(data.table)
    DT <- fread("      date  value
     1/11/2011      5
     1/11/2011      4
     1/11/2011      2
     8/11/2011      1
    13/11/2011      0
    14/11/2011      0
    15/11/2011      0
    18/11/2011      1
    21/11/2011      4
     5/12/2011      3")[
       # coerce date from character string to integer date class
       , date := as.IDate(date, "%d/%m/%Y")]
    
    0 讨论(0)
  • Here is my tinkering with the problem. If that sort of gets at what you wanted (I don't know if it's satisfactory in terms of speed), I can write it up as a more detailed answer (even though it's based on @rbatt's idea).

    library(zoo)
    library(dplyr)
    
    # create a long time series
    start <- as.Date("1800-01-01")
    end <- as.Date(Sys.Date())
    
    df <- data.frame(V1 = seq.Date(start, end, by = "day"))
    df$V2 <- sample(1:10, nrow(df), replace = T)
    
    # make it an irregular time series by sampling 10000 rows
    # including allowing for duplicates (replace = T)
    df2 <- df %>% 
      sample_n(10000, replace = T)
    
    # create 'complete' time series & join the data & compute the rolling median
    df_rollmed <- data.frame(V1 = seq.Date(min(df$V1), max(df$V1), by = "day")) %>% 
      left_join(., df2) %>% 
      mutate(rollmed = rollapply(V2, 5, median, na.rm = T, align = "right", partial = T)) %>% 
      filter(!is.na(V2)) # throw out the NAs from the complete dataset
    
    0 讨论(0)
  • 2020-12-01 10:11

    1) rollapply Haven't check the speed but if no date has more than max.dup occurences then it must be that the last 5 * max.dup entries contain the last 5 days so the one-line function fn shown below passed to rollapplyr will do it:

    k <- 5
    
    dates <- as.numeric(DF$date)
    values <- DF$value
    
    max.dup <- max(table(dates))
    
    fn <- function(ix, d = dates[ix], v = values[ix], n = length(ix)) median(v[d >= d[n]-k])
    
    rollapplyr(1:nrow(DF), max.dup * k, fn, partial = TRUE)
    ## [1] 5.0 4.5 4.0 1.0 0.5 0.0 0.0 0.0 2.5 3.0
    

    2) sqldf We can use an SQL self-join to do this. We join to each a row those b rows no more than 5 days back and then group by the a row taking the median of the b rows joined to it.

    library(sqldf)
    
    k <- 5
    res <- fn$sqldf("select a.date, a.value, median(b.value) median
           from DF a
           left join DF b on b.date between a.date - $k and a.date and b.rowid <= a.rowid
           group by a.rowid")
    

    giving:

    res$median
    ## [1] 5.0 4.5 4.0 1.0 0.5 0.0 0.0 0.0 2.5 3.0
    

    Note: We used this for DF:

     Lines <- "
          date  value
     1/11/2011      5
     1/11/2011      4
     1/11/2011      2
     8/11/2011      1
    13/11/2011      0
    14/11/2011      0
    15/11/2011      0
    18/11/2011      1
    21/11/2011      4
     5/12/2011      3
    "
    DF <- read.table(text = Lines, header = TRUE)
    DF$date <- as.Date(DF$date, format = "%d/%m/%Y")
    
    0 讨论(0)
提交回复
热议问题