R- Calculate a count of items over time using start and end dates

后端 未结 5 1241
隐瞒了意图╮
隐瞒了意图╮ 2021-01-05 17:22

I want to calculate a count of items over time using their Start and End dates.

Some sample data

START <- as.Date(c(\"2014-01-01\", \"2014-01-02\         


        
相关标签:
5条回答
  • 2021-01-05 17:59

    Using dplyr and grouped data:

    data_frame(
                START = as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03")),
                END   = as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
               ) -> df
    rbind(cbind(group = 'a', df),cbind(group = 'b', df)) %>% as_data_frame->df
    df
    
    df %>% 
      group_by(.,group) %>% 
      do(data.frame(table(Reduce(c, Map(seq, .$START, .$END, by = 1)))))
    

    This is a common problem when you for example want to find the number of logins on different pages/machines etc given time-intervals per users

    > df
    Source: local data frame [8 x 3]
    
      group      START        END
      (chr)     (date)     (date)
    1     a 2014-01-01 2014-01-04
    2     a 2014-01-02 2014-01-03
    3     a 2014-01-03 2014-01-03
    4     a 2014-01-03 2014-01-04
    5     b 2014-01-01 2014-01-04
    6     b 2014-01-02 2014-01-03
    7     b 2014-01-03 2014-01-03
    8     b 2014-01-03 2014-01-04
    > 
    > df %>% 
    +   group_by(.,group) %>% 
    +   do(data.frame(table(Reduce(c, Map(seq, .$START, .$END, by = 1)))))
    Source: local data frame [8 x 3]
    Groups: group [2]
    
      group       Var1  Freq
      (chr)     (fctr) (int)
    1     a 2014-01-01     1
    2     a 2014-01-02     2
    3     a 2014-01-03     4
    4     a 2014-01-04     2
    5     b 2014-01-01     1
    6     b 2014-01-02     2
    7     b 2014-01-03     4
    8     b 2014-01-04     2
    
    0 讨论(0)
  • 2021-01-05 18:05

    This would do it. You can change the column names as necessary.

    as.data.frame(table(Reduce(c, Map(seq, df$START, df$END, by = 1))))
    #         Var1 Freq
    # 1 2014-01-01    1
    # 2 2014-01-02    2
    # 3 2014-01-03    4
    # 4 2014-01-04    2
    

    As noted in the comments, Var1 in the above solution is now a factor, and not a date. To keep the date class in the first column, you could do some more work to the above solution, or use plyr::count instead of as.data.frame(table(...))

    library(plyr)
    count(Reduce(c, Map(seq, df$START, df$END, by = 1)))
    #            x freq
    # 1 2014-01-01    1
    # 2 2014-01-02    2
    # 3 2014-01-03    4
    # 4 2014-01-04    2
    
    0 讨论(0)
  • 2021-01-05 18:09

    You could use data.table

    library(data.table)
    DT <- setDT(df)[, list(DATETIME= seq(START, END, by=1)), by=1:nrow(df)][,
                               list(COUNT=.N), by=DATETIME]
     DT
     #     DATETIME COUNT
     #1: 2014-01-01     1
     #2: 2014-01-02     2
     #3: 2014-01-03     4
     #4: 2014-01-04     2
    

    From version 1.9.4+, you can also use the function foverlaps() to do an "overlap join". It's more efficient as it doesn't have to expand the dates for each row first, and then count. Here's how:

    require(data.table) ## 1.9.4
    setDT(df) ## convert your data.frame to data.table by reference
    
    ## 1. Some preprocessing:
    # create a lookup - the dates for which you need the count, and set key
    dates = seq(as.Date("2014-01-01"), as.Date("2014-01-04"), by="days")
    lookup = data.table(START=dates, END=dates, key=c("START", "END"))
    
    ## 2. Now find overlapping coordinates 
    # for each row in `df` get all the rows it overlaps with in `lookup`
    ans = foverlaps(df, lookup, type="any", which=TRUE)
    

    Now, we just have to group by yid (= indices in lookup) and count:

    ## 3. count
    ans[, .N, by=yid]
    #    yid N
    # 1:   1 1
    # 2:   2 2
    # 3:   3 4
    # 4:   4 2
    

    The first column corresponds to the row numbers in lookup. If some numbers are missing, then the count is 0 for them.

    0 讨论(0)
  • 2021-01-05 18:20

    I just proposed another lubridate-based solution that's faster for larger dataframes with wide date ranges in a newer and related SO post here

    0 讨论(0)
  • 2021-01-05 18:22

    Using dplyr and foreach:

    library(dplyr)
    library(foreach)
    
    df <- data.frame(START = as.Date(c("2014-01-01",
                                       "2014-01-02",
                                       "2014-01-03",
                                       "2014-01-03")),
                     END = as.Date(c("2014-01-04",
                                     "2014-01-03",
                                     "2014-01-03",
                                     "2014-01-04")))
    df
    
    r <- foreach(DATETIME = seq(min(df$START), max(df$END), by = 1),
                 .combine = rbind) %do% {
      df %>%
        filter(DATETIME >= START & DATETIME <= END) %>%
        summarise(DATETIME, COUNT = n())
    }
    r
    
    0 讨论(0)
提交回复
热议问题