问题
I want to calculate a count of items over time using their Start and End dates.
Some sample data
START <- as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03"))
END <- as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
df <- data.frame(START,END)
df
gives
START END
1 2014-01-01 2014-01-04
2 2014-01-02 2014-01-03
3 2014-01-03 2014-01-03
4 2014-01-03 2014-01-04
A table showing a count of these items across time (based on their Start and End times) is as follows:
DATETIME COUNT
2014-01-01 1
2014-01-02 2
2014-01-03 4
2014-01-04 2
Can this be done using R, especially using dplyr? Many thanks.
回答1:
This would do it. You can change the column names as necessary.
as.data.frame(table(Reduce(c, Map(seq, df$START, df$END, by = 1))))
# Var1 Freq
# 1 2014-01-01 1
# 2 2014-01-02 2
# 3 2014-01-03 4
# 4 2014-01-04 2
As noted in the comments, Var1
in the above solution is now a factor, and not a date. To keep the date class in the first column, you could do some more work to the above solution, or use plyr::count
instead of as.data.frame(table(...))
library(plyr)
count(Reduce(c, Map(seq, df$START, df$END, by = 1)))
# x freq
# 1 2014-01-01 1
# 2 2014-01-02 2
# 3 2014-01-03 4
# 4 2014-01-04 2
回答2:
You could use data.table
library(data.table)
DT <- setDT(df)[, list(DATETIME= seq(START, END, by=1)), by=1:nrow(df)][,
list(COUNT=.N), by=DATETIME]
DT
# DATETIME COUNT
#1: 2014-01-01 1
#2: 2014-01-02 2
#3: 2014-01-03 4
#4: 2014-01-04 2
From version 1.9.4+, you can also use the function foverlaps()
to do an "overlap join". It's more efficient as it doesn't have to expand the dates for each row first, and then count. Here's how:
require(data.table) ## 1.9.4
setDT(df) ## convert your data.frame to data.table by reference
## 1. Some preprocessing:
# create a lookup - the dates for which you need the count, and set key
dates = seq(as.Date("2014-01-01"), as.Date("2014-01-04"), by="days")
lookup = data.table(START=dates, END=dates, key=c("START", "END"))
## 2. Now find overlapping coordinates
# for each row in `df` get all the rows it overlaps with in `lookup`
ans = foverlaps(df, lookup, type="any", which=TRUE)
Now, we just have to group by yid
(= indices in lookup
) and count:
## 3. count
ans[, .N, by=yid]
# yid N
# 1: 1 1
# 2: 2 2
# 3: 3 4
# 4: 4 2
The first column corresponds to the row numbers in lookup
. If some numbers are missing, then the count is 0 for them.
回答3:
Using dplyr and grouped data:
data_frame(
START = as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03")),
END = as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
) -> df
rbind(cbind(group = 'a', df),cbind(group = 'b', df)) %>% as_data_frame->df
df
df %>%
group_by(.,group) %>%
do(data.frame(table(Reduce(c, Map(seq, .$START, .$END, by = 1)))))
This is a common problem when you for example want to find the number of logins on different pages/machines etc given time-intervals per users
> df
Source: local data frame [8 x 3]
group START END
(chr) (date) (date)
1 a 2014-01-01 2014-01-04
2 a 2014-01-02 2014-01-03
3 a 2014-01-03 2014-01-03
4 a 2014-01-03 2014-01-04
5 b 2014-01-01 2014-01-04
6 b 2014-01-02 2014-01-03
7 b 2014-01-03 2014-01-03
8 b 2014-01-03 2014-01-04
>
> df %>%
+ group_by(.,group) %>%
+ do(data.frame(table(Reduce(c, Map(seq, .$START, .$END, by = 1)))))
Source: local data frame [8 x 3]
Groups: group [2]
group Var1 Freq
(chr) (fctr) (int)
1 a 2014-01-01 1
2 a 2014-01-02 2
3 a 2014-01-03 4
4 a 2014-01-04 2
5 b 2014-01-01 1
6 b 2014-01-02 2
7 b 2014-01-03 4
8 b 2014-01-04 2
回答4:
Using dplyr
and foreach
:
library(dplyr)
library(foreach)
df <- data.frame(START = as.Date(c("2014-01-01",
"2014-01-02",
"2014-01-03",
"2014-01-03")),
END = as.Date(c("2014-01-04",
"2014-01-03",
"2014-01-03",
"2014-01-04")))
df
r <- foreach(DATETIME = seq(min(df$START), max(df$END), by = 1),
.combine = rbind) %do% {
df %>%
filter(DATETIME >= START & DATETIME <= END) %>%
summarise(DATETIME, COUNT = n())
}
r
回答5:
I just proposed another lubridate-based solution that's faster for larger dataframes with wide date ranges in a newer and related SO post here
来源:https://stackoverflow.com/questions/26290314/r-calculate-a-count-of-items-over-time-using-start-and-end-dates