Rolling Count of Events Over Time Series

后端 未结 2 700
悲&欢浪女
悲&欢浪女 2021-01-19 09:56

I\'m trying to calculate a rolling count/sum of occurrences by group over the series of a time frame.

I have a data frame with some sample data like this:

相关标签:
2条回答
  • 2021-01-19 10:38

    For this example, you can probably use sapply to analyze each row, counting the number of entries on that day or up to 4 days earlier, like so:

    df1$lastFour <-
      sapply(df1$dates, function(x){
        sum(df1$dates <= x & df1$dates >= x - 4)
      })
    

    Results in df1 of:

           dates group1 group2 lastFour
    1 2011-10-09      A      D        1
    2 2011-10-15      C      A        1
    3 2011-10-16      A      B        2
    4 2011-10-18      A      H        3
    5 2011-10-21      L      A        2
    6 2011-10-22      F      A        3
    7 2011-10-24      A      E        3
    

    If, as your question implies, your data are from a larger set and you want to do the analysis on each group (conceptually, I think the question is: how many events have had this group in the last four days? asked only on days with an event from that group), you could follow the steps below.

    First, here are some larger sample data with groups labelled as the first 10 letters of the alphabet:

    biggerData <-
      data.frame(
        dates = sample(seq(as.Date("2011-10-01")
                           , as.Date("2011-10-31")
                           , 1)
                       , 100, TRUE)
        , group1 = sample(LETTERS[1:10], 100, TRUE)
        , group2 = sample(LETTERS[1:10], 100, TRUE)
      )
    

    Next, I extract all of the groups in the data (here, I know them, but for your real data, you may or may not have that list of groups already)

    groupsInData <-
      sort(unique(c(as.character(biggerData$group1)
                    , as.character(biggerData$group2))))
    

    Then, I loop through that vector of group names and extract each of the events with that group as one of the two groups, adding the same column as above, and saving the separate data.frames in a list (and naming them to make it easier to access/track them).

    sepGroupCounts <- lapply(groupsInData, function(thisGroup){
      dfTemp <- biggerData[biggerData$group1 == thisGroup | 
                             biggerData$group2 == thisGroup, ]
    
      dfTemp$lastFour <-
        sapply(dfTemp$dates, function(x){
          sum(dfTemp$dates <= x & dfTemp$dates >= x - 4)
        })
      return(dfTemp)
    
    }) 
    
    names(sepGroupCounts) <- groupsInData
    

    returns a data.frame just like above for each of the groups in your data.

    And, I couldn't help myself, so here is a dplyr and tidyr solution as well. It is not much different than the list-based solution above, except that it returns everything in the same data.frame (which may or may not be a good thing, particularly as it will have two entries for each event this way).

    First, for simplicity, I defined a function to do the date checking. This could easily be used above as well.

    myDateCheckFunction <- function(x){
      sapply(x, function(thisX){
        sum(x <= thisX & x >= thisX - 4 )
      })
    }
    

    Next, I am constructing a set of logical tests that will determine whether or not each of the groups is present. These will be used to generate columns for each group, giving TRUE/FALSE for present/absent in each event.

    dotsConstruct <-
      paste0("group1 == '", groupsInData, "' | "
             , "group2 == '", groupsInData, "'") %>%
      setNames(groupsInData)
    

    Finally, putting it altogether in one piped call. Instead of describing, I have commented each step.

    withLastFour <-
      # Start with data
      biggerData %>%
      # Add a col for each group using Standard Evaluation
      mutate_(.dots = dotsConstruct) %>%
      # convert to long form; one row per group per event
      gather(GroupAnalyzed, Present, -dates, -group1, -group2) %>%
      # Limit to only rows where the `GroupAnalyzed` is present
      filter(Present) %>%
      # Remove the `Present` column, as it is now all "TRUE"
      select(-Present) %>%
      # Group by the groups we are analyzing
      group_by(GroupAnalyzed) %>%
      # Add the column for count in the last four dates
      # `group_by` limits this to just counts within that group
      mutate(lastFour = myDateCheckFunction(dates)) %>%
      # Sort by group and date for prettier checking
      arrange(GroupAnalyzed, dates)
    

    The result is similar to the above list output, except with everything in one data.frame, which may allow for easier analysis of some features. The top looks like this:

           dates group1 group2 GroupAnalyzed lastFour
          <date> <fctr> <fctr>         <chr>    <int>
    1 2011-10-01      B      A             A        1
    2 2011-10-02      J      A             A        2
    3 2011-10-05      C      A             A        5
    4 2011-10-05      C      A             A        5
    5 2011-10-05      G      A             A        5
    6 2011-10-08      E      A             A        5
    

    Note that my random sample had multiple events on Oct-05, leading to the large counts here.

    0 讨论(0)
  • 2021-01-19 10:41

    I think, but am not sure, that you're looking for a way to count occurrences of each event type (letter) on each date (row) and the preceding four days, whether or not those preceding four days appear in your data. If that's right, then here's one approach using dplyr (for general convenience), tidyr (to make the wide data long for easier counting by date), and zoo (for its rollapply function).

    library(dplyr)
    library(tidyr)
    library(zoo)
    
    df2 <- df1 %>%
      # make the wide data long so we can group and then count by date
      gather(key = group, value = event, group1:group2) %>%
      # group by date
      group_by(dates) %>%
      # count occurrences of the event of interest on each date
      summarise(sum.a = sum(event == "A")) %>%
      # join that set of counts to a complete date sequence
      left_join(data.frame(dates = seq(first(dates), last(dates), by = "day")), .) %>%
      # use rollapply to get sums of those counts across rolling windows that
      # are 4 days wide and right-aligned
      mutate(sum.a = rollapply(sum.a, width = 4, sum, na.rm = TRUE,
                                   partial = TRUE, align = "right")) %>%
      # filter back to the original set of dates in df1
      filter(dates %in% df1$dates)
    

    Result:

    > df2
           dates sum.a
    1 2011-10-09     1
    2 2011-10-15     1
    3 2011-10-16     2
    4 2011-10-18     3
    5 2011-10-21     2
    6 2011-10-22     2
    7 2011-10-24     3
    
    0 讨论(0)
提交回复
热议问题