R- Calculate a count of items over time using start and end dates

问题

I want to calculate a count of items over time using their Start and End dates.

Some sample data

START <- as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03"))
END <- as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
df <- data.frame(START,END)
df

gives

       START        END
1 2014-01-01 2014-01-04
2 2014-01-02 2014-01-03
3 2014-01-03 2014-01-03
4 2014-01-03 2014-01-04

A table showing a count of these items across time (based on their Start and End times) is as follows:

DATETIME    COUNT
2014-01-01   1 
2014-01-02   2 
2014-01-03   4 
2014-01-04   2

Can this be done using R, especially using dplyr? Many thanks.

回答1:

This would do it. You can change the column names as necessary.

as.data.frame(table(Reduce(c, Map(seq, df$START, df$END, by = 1))))
#         Var1 Freq
# 1 2014-01-01    1
# 2 2014-01-02    2
# 3 2014-01-03    4
# 4 2014-01-04    2

As noted in the comments, Var1 in the above solution is now a factor, and not a date. To keep the date class in the first column, you could do some more work to the above solution, or use plyr::count instead of as.data.frame(table(...))

library(plyr)
count(Reduce(c, Map(seq, df$START, df$END, by = 1)))
#            x freq
# 1 2014-01-01    1
# 2 2014-01-02    2
# 3 2014-01-03    4
# 4 2014-01-04    2

回答2:

You could use data.table

library(data.table)
DT <- setDT(df)[, list(DATETIME= seq(START, END, by=1)), by=1:nrow(df)][,
                           list(COUNT=.N), by=DATETIME]
 DT
 #     DATETIME COUNT
 #1: 2014-01-01     1
 #2: 2014-01-02     2
 #3: 2014-01-03     4
 #4: 2014-01-04     2

From version 1.9.4+, you can also use the function foverlaps() to do an "overlap join". It's more efficient as it doesn't have to expand the dates for each row first, and then count. Here's how:

require(data.table) ## 1.9.4
setDT(df) ## convert your data.frame to data.table by reference

## 1. Some preprocessing:
# create a lookup - the dates for which you need the count, and set key
dates = seq(as.Date("2014-01-01"), as.Date("2014-01-04"), by="days")
lookup = data.table(START=dates, END=dates, key=c("START", "END"))

## 2. Now find overlapping coordinates 
# for each row in `df` get all the rows it overlaps with in `lookup`
ans = foverlaps(df, lookup, type="any", which=TRUE)

Now, we just have to group by yid (= indices in lookup) and count:

## 3. count
ans[, .N, by=yid]
#    yid N
# 1:   1 1
# 2:   2 2
# 3:   3 4
# 4:   4 2

The first column corresponds to the row numbers in lookup. If some numbers are missing, then the count is 0 for them.

回答3:

Using dplyr and grouped data:

data_frame(
            START = as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03")),
            END   = as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
           ) -> df
rbind(cbind(group = 'a', df),cbind(group = 'b', df)) %>% as_data_frame->df
df

df %>% 
  group_by(.,group) %>% 
  do(data.frame(table(Reduce(c, Map(seq, .$START, .$END, by = 1)))))

This is a common problem when you for example want to find the number of logins on different pages/machines etc given time-intervals per users

> df
Source: local data frame [8 x 3]

  group      START        END
  (chr)     (date)     (date)
1     a 2014-01-01 2014-01-04
2     a 2014-01-02 2014-01-03
3     a 2014-01-03 2014-01-03
4     a 2014-01-03 2014-01-04
5     b 2014-01-01 2014-01-04
6     b 2014-01-02 2014-01-03
7     b 2014-01-03 2014-01-03
8     b 2014-01-03 2014-01-04
> 
> df %>% 
+   group_by(.,group) %>% 
+   do(data.frame(table(Reduce(c, Map(seq, .$START, .$END, by = 1)))))
Source: local data frame [8 x 3]
Groups: group [2]

  group       Var1  Freq
  (chr)     (fctr) (int)
1     a 2014-01-01     1
2     a 2014-01-02     2
3     a 2014-01-03     4
4     a 2014-01-04     2
5     b 2014-01-01     1
6     b 2014-01-02     2
7     b 2014-01-03     4
8     b 2014-01-04     2

回答4:

Using dplyr and foreach:

library(dplyr)
library(foreach)

df <- data.frame(START = as.Date(c("2014-01-01",
                                   "2014-01-02",
                                   "2014-01-03",
                                   "2014-01-03")),
                 END = as.Date(c("2014-01-04",
                                 "2014-01-03",
                                 "2014-01-03",
                                 "2014-01-04")))
df

r <- foreach(DATETIME = seq(min(df$START), max(df$END), by = 1),
             .combine = rbind) %do% {
  df %>%
    filter(DATETIME >= START & DATETIME <= END) %>%
    summarise(DATETIME, COUNT = n())
}
r

回答5:

I just proposed another lubridate-based solution that's faster for larger dataframes with wide date ranges in a newer and related SO post here

来源：https://stackoverflow.com/questions/26290314/r-calculate-a-count-of-items-over-time-using-start-and-end-dates

标签

duration

dplyr