问题
Introduction
I am using R to analyze the 'momentum' of protest movements in Africa. To do so, I am analyzing individual protest events. I want to create a rolling measure of the rolling number (sum) of protests within a time period.
Most of the answers here on Stack Overflow deal with datasets where observations are at fixed intervals (one obs. per day or per month, etc.). But my data are 'ragged' in the sense that they occur in different intervals. Sometimes there is one day between observations. Other times there are two weeks.
What I want to create
A rolling sum of the number of protest events that have occurred in a given country over the past 10 days. This would be in the form of a variable that simply sums the number of events within the past ten days, inclusive of the current event.
The Data
Here is a reproducible set of data:
df1 <- data.frame(date = c("8/1/2019", "8/2/2019", "8/3/2019", "8/6/2019", "8/15/2019", "8/16/2019", "8/30/2019", "9/1/2019", "9/2/2019", "9/3/2019", "9/4/2019", "6/1/2019", "6/26/2019", "7/1/2019", "7/2/2019", "7/9/2019", "7/10/2019", "8/1/2019", "8/2/2019", "8/15/2019", "8/28/2019", "9/1/2019"),
country = c(rep("Algeria", 11), rep("Benin", 11)),
event = rep("Protest", 22))
What I want the data to look like
date country event roll_sum
-------- ------- ------- --------
8/1/2019 Algeria Protest 1
8/2/2019 Algeria Protest 2
8/3/2019 Algeria Protest 3
8/6/2019 Algeria Protest 4
8/15/2019 Algeria Protest 2
8/16/2019 Algeria Protest 3
8/30/2019 Algeria Protest 1
9/1/2019 Algeria Protest 2
9/2/2019 Algeria Protest 3
9/3/2019 Algeria Protest 4
9/4/2019 Algeria Protest 5
6/1/2019 Benin Protest 1
6/26/2019 Benin Protest 1
7/1/2019 Benin Protest 2
7/2/2019 Benin Protest 3
7/9/2019 Benin Protest 3
7/10/2019 Benin Protest 4
8/1/2019 Benin Protest 1
8/2/2019 Benin Protest 2
8/15/2019 Benin Protest 1
8/28/2019 Benin Protest 1
9/1/2019 Benin Protest 2
This is all probably very simple, but I can't figure out how to do it. Thank you in advance!
回答1:
A base R
approach,
df1$date <- as.Date(df1$date,"%m/%d/%Y")
vector <- vector()
for( j in unique(df1$country)) {
df2 <- df1[df1$country==j,]
for(i in 1:nrow(df2)) {
k <- nrow(df2[df2$date<= df2$date[i] & df2$date>=df2$date[i]-10 ,])
vector <- c(vector, k)
}
}
df1$roll_sum <- vector
gives,
date country event roll_sum
1 2019-08-01 Algeria Protest 1
2 2019-08-02 Algeria Protest 2
3 2019-08-03 Algeria Protest 3
4 2019-08-06 Algeria Protest 4
5 2019-08-15 Algeria Protest 2
6 2019-08-16 Algeria Protest 3
7 2019-08-30 Algeria Protest 1
8 2019-09-01 Algeria Protest 2
9 2019-09-02 Algeria Protest 3
10 2019-09-03 Algeria Protest 4
11 2019-09-04 Algeria Protest 5
12 2019-06-01 Benin Protest 1
13 2019-06-26 Benin Protest 1
14 2019-07-01 Benin Protest 2
15 2019-07-02 Benin Protest 3
16 2019-07-09 Benin Protest 3
17 2019-07-10 Benin Protest 4
18 2019-08-01 Benin Protest 1
19 2019-08-02 Benin Protest 2
20 2019-08-15 Benin Protest 1
21 2019-08-28 Benin Protest 1
22 2019-09-01 Benin Protest 2
回答2:
use lubridate
to convert date string into date
and create intervals using interval
function. %within%
is a function in lubridate
which returns whether the given date vector is within the interval.
Create a dates
column which on each row is a list that stores all dates for that country. And use purrr::pmap()
to iterate all rows in the modified data frame.
library(lubridate)
library(dplyr)
library(purrr)
df1 <- data.frame(date = c("8/1/2019", "8/2/2019", "8/3/2019", "8/6/2019", "8/15/2019", "8/16/2019", "8/30/2019", "9/1/2019", "9/2/2019", "9/3/2019", "9/4/2019", "6/1/2019", "6/26/2019", "7/1/2019", "7/2/2019", "7/9/2019", "7/10/2019", "8/1/2019", "8/2/2019", "8/15/2019", "8/28/2019", "9/1/2019"),
country = c(rep("Algeria", 11), rep("Benin", 11)),
event = rep("Protest", 22))
df2 <- df1 %>%
mutate(
date = mdy(date),
interval = interval(date -days(10),date)
) %>%
group_by(country) %>%
mutate(dates = list(date)) %>%
ungroup()
df2["roll_sum"] <- pmap_dbl(df2,function(...){
values <- list(...)
sum(values$dates %within% values$interval)
})
df2 %>%
select(-interval,-dates)
# A tibble: 22 x 4
date country event roll_sum
<date> <fct> <fct> <dbl>
1 2019-08-01 Algeria Protest 1
2 2019-08-02 Algeria Protest 2
3 2019-08-03 Algeria Protest 3
4 2019-08-06 Algeria Protest 4
5 2019-08-15 Algeria Protest 2
6 2019-08-16 Algeria Protest 3
7 2019-08-30 Algeria Protest 1
8 2019-09-01 Algeria Protest 2
9 2019-09-02 Algeria Protest 3
10 2019-09-03 Algeria Protest 4
# ... with 12 more rows
回答3:
rollapply
in zoo takes a width argument which can be a vector in case each point has a different width. To compute that width w
we convert date
to Date
class and then use ave
to compute for each country the widths via wfun
which uses findInterval
to find the position of the most recent date no later than 11 days ago. If we subtract that position from the current position it will give us the desired width. Finally we run rollapplyr
.
In the question all events shown were Protest
and if that were always the case then the rolling sum would equal w
so we could avoid the rolling computation in the last line of code; however, we did not make such simplification in case your full data set includes other types of event that should not be counted.
library(zoo)
df2 <- transform(df1, date = as.Date(date, "%m/%d/%Y"))
wfun <- function(x) seq_along(x) - findInterval(x - 11, x)
w <- with(df2, ave(as.numeric(date), country, FUN = wfun))
transform(df2, roll_sum = rollapplyr(event == "Protest", w, sum))
giving (continued after output):
date country event roll_sum
1 2019-08-01 Algeria Protest 1
2 2019-08-02 Algeria Protest 2
3 2019-08-03 Algeria Protest 3
4 2019-08-06 Algeria Protest 4
5 2019-08-15 Algeria Protest 2
6 2019-08-16 Algeria Protest 3
7 2019-08-30 Algeria Protest 1
8 2019-09-01 Algeria Protest 2
9 2019-09-02 Algeria Protest 3
10 2019-09-03 Algeria Protest 4
11 2019-09-04 Algeria Protest 5
12 2019-06-01 Benin Protest 1
13 2019-06-26 Benin Protest 1
14 2019-07-01 Benin Protest 2
15 2019-07-02 Benin Protest 3
16 2019-07-09 Benin Protest 3
17 2019-07-10 Benin Protest 4
18 2019-08-01 Benin Protest 1
19 2019-08-02 Benin Protest 2
20 2019-08-15 Benin Protest 1
21 2019-08-28 Benin Protest 1
22 2019-09-01 Benin Protest 2
Note
We can double check w
using a second approach to calculate w
. This involves scanning all of date
for each element of the width vector so using the following approach is rather inefficient compared to the findInterval
approach shown above but just as a double check that should not matter.
wfun2 <- function(x) sapply(x, function(y) sum(x >= y-10 & x <= y))
w2 <- with(df2, ave(as.numeric(date), country, FUN = wfun2))
identical(w, w2)
## [1] TRUE
回答4:
Here is another way using dplyr
and purrr::map_int
. We can group_by
country
and find out number of rows in the dataset in past 10 days from current date
.
library(dplyr)
df1 %>%
mutate(date = as.Date(date, "%m/%d/%Y")) %>%
group_by(country) %>%
mutate(roll_sum = purrr::map_int(date, ~sum(date >= (.x - 10) & date <= (.x))))
# date country event roll_sum
# <date> <fct> <fct> <int>
# 1 2019-08-01 Algeria Protest 1
# 2 2019-08-02 Algeria Protest 2
# 3 2019-08-03 Algeria Protest 3
# 4 2019-08-06 Algeria Protest 4
# 5 2019-08-15 Algeria Protest 2
# 6 2019-08-16 Algeria Protest 3
# 7 2019-08-30 Algeria Protest 1
# 8 2019-09-01 Algeria Protest 2
# 9 2019-09-02 Algeria Protest 3
#10 2019-09-03 Algeria Protest 4
# … with 12 more rows
来源:https://stackoverflow.com/questions/57861197/conditional-rolling-sum-of-events-with-ragged-dates