问题
I have performed counts of events (in Group 1) over a time period for each group (in Group 2). I am looking to spread Group 1 events into separate columns, and using Group 2 and timestamp as rows. Each cell will contain the counts of events over a time period (Present date to the previous 4 days).
See the example below, for each of the Group 2 (I & II) I counted Events A and L in Group 1 happened within 4 days.
dates = as.Date(c("2011-10-09",
"2011-10-15",
"2011-10-16",
"2011-10-18",
"2011-10-21",
"2011-10-22",
"2011-10-24"))
group1=c("A",
"A",
"A",
"A",
"L",
"L",
"A")
group2=c("I",
"I",
"I",
"I",
"I",
"I",
"II")
df1 <- data.frame(dates, group1, group2)
Using dplyr pipes I managed to produce the following table (also see Count event types over time series by multiple conditions)
df1 %>%
group_by(group1, group2) %>%
mutate(count = sapply(dates
, function(x){
sum(dates <= x & dates > (x-4))
}))
dates group1 group2 count
<date> <fctr> <fctr> <int>
1 2011-10-09 A I 1
2 2011-10-15 A I 1
3 2011-10-16 A I 2
4 2011-10-18 A I 3
5 2011-10-21 L I 1
6 2011-10-22 L I 2
7 2011-10-24 A II 1
Eventually, I want to obtain a table similar to this, with Events A & L counts update according to dates (time period = current date - 4 days) in both I & II (Group 2).
dates group1 group2 count (A) count (L)
1 2011-10-09 A I 1 0
2 2011-10-15 A I 1 0
3 2011-10-16 A I 2 0
4 2011-10-18 A I 3 0
5 2011-10-21 L I 0 1
6 2011-10-22 L I 0 2
7 2011-10-24 A II 1 0
In a larger dataset, not all events in Group 1 appears in every Group 2. How can I update these empty cells so that it will either 1) carry forward the count from the previous row or 2) update the count based on the updated timestamp/ time period?
Thanks!
回答1:
While it is still a bit unclear what you want (see comments on the question), here are two potential approaches.
If all you want to do is spread the count
column out (for some reason) and fill it with 0's (whether there was an event in the preceding 4 days or not) and still count by the group2
breakdown (even though you are only labeling by group1
) and leave the event details in place (like your example in your question), you can just create a column with the labels you want, then use spread
to create the new columns. This
df1 %>%
group_by(group1, group2) %>%
mutate(count = sapply(dates
, function(x){
sum(dates <= x & dates > (x-4))
})) %>%
ungroup() %>%
mutate(toSpread = paste0("Count (", group1, ")")) %>%
spread(toSpread, count, fill = 0)
returns this:
dates group1 group2 `Count (A)` `Count (L)`
* <date> <fctr> <fctr> <dbl> <dbl>
1 2011-10-09 A I 1 0
2 2011-10-15 A I 1 0
3 2011-10-16 A I 2 0
4 2011-10-18 A I 3 0
5 2011-10-21 L I 0 1
6 2011-10-22 L I 0 2
7 2011-10-24 A II 1 0
Which matches the output you have shown in your question. However, if what you want is a count on any day with an event of how many of each group1's events have occurred, you will need to step back a bit further. For that, you need to generate a new data frame with the dates you want -- with a row for each group. This is easy to get using complete
from tidyr
. Then, you can check each of those for events that occurred in the preceding four days for that group.
df1 %>%
select(dates, group1) %>%
complete(dates, group1) %>%
mutate(count = sapply(1:n()
, function(idx){
sum(df1$dates <= dates[idx] &
df1$dates > (dates[idx]-4) &
df1$group1 == group1[idx])
})) %>%
mutate(group1 = paste0("Count (", group1, ")")) %>%
spread(group1, count, fill = 0)
returns:
# A tibble: 7 x 3
dates `Count (A)` `Count (L)`
* <date> <dbl> <dbl>
1 2011-10-09 1 0
2 2011-10-15 1 0
3 2011-10-16 2 0
4 2011-10-18 3 0
5 2011-10-21 1 1
6 2011-10-22 0 2
7 2011-10-24 1 2
Note that, if you want to include days for which there were no events, you can do that by passing the dates you want checked into complete
. For example:
df1 %>%
select(dates, group1) %>%
complete(dates = full_seq(dates, 1), group1) %>%
mutate(count = sapply(1:n()
, function(idx){
sum(df1$dates <= dates[idx] &
df1$dates > (dates[idx]-4) &
df1$group1 == group1[idx])
})) %>%
mutate(group1 = paste0("Count (", group1, ")")) %>%
spread(group1, count, fill = 0)
returns:
dates `Count (A)` `Count (L)`
* <date> <dbl> <dbl>
1 2011-10-09 1 0
2 2011-10-10 1 0
3 2011-10-11 1 0
4 2011-10-12 1 0
5 2011-10-13 0 0
6 2011-10-14 0 0
7 2011-10-15 1 0
8 2011-10-16 2 0
9 2011-10-17 2 0
10 2011-10-18 3 0
11 2011-10-19 2 0
12 2011-10-20 1 0
13 2011-10-21 1 1
14 2011-10-22 0 2
15 2011-10-23 0 2
16 2011-10-24 1 2
Based on the comments, I think I am finally understanding the goal. First, I would start out by, as above, creating a "long" data frame with the counts for each group1/group2 pair for every date:
fullDateCounts <-
df1 %>%
select(dates, group1, group2) %>%
complete(dates = full_seq(dates, 1), group1, group2) %>%
mutate(count = sapply(1:n()
, function(idx){
sum(df1$dates <= dates[idx] &
df1$dates > (dates[idx]-4) &
df1$group1 == group1[idx] &
df1$group2 == group2[idx]
)
}))
The top of this is:
dates group1 group2 count
<date> <fctr> <fctr> <int>
1 2011-10-09 A I 1
2 2011-10-09 A II 0
3 2011-10-09 L I 0
4 2011-10-09 L II 0
5 2011-10-10 A I 1
6 2011-10-10 A II 0
7 2011-10-10 L I 0
8 2011-10-10 L II 0
9 2011-10-11 A I 1
10 2011-10-11 A II 0
# ... with 54 more rows
From there, if you really need to convert to a wide form, you can either do so with a row for each group2 (or group1, if you switch the column names):
fullDateCounts %>%
mutate(group1 = paste0("Count (", group1, ")")) %>%
spread(group1, count, fill = 0)
returns:
dates group2 `Count (A)` `Count (L)`
* <date> <fctr> <dbl> <dbl>
1 2011-10-09 I 1 0
2 2011-10-09 II 0 0
3 2011-10-10 I 1 0
4 2011-10-10 II 0 0
5 2011-10-11 I 1 0
6 2011-10-11 II 0 0
7 2011-10-12 I 1 0
8 2011-10-12 II 0 0
9 2011-10-13 I 0 0
10 2011-10-13 II 0 0
# ... with 22 more rows
Or, you can generate a column for each group1/group2 pair:
fullDateCounts %>%
mutate(toSpread = paste0("Count (", group1, "-", group2, ")")) %>%
select(-group1, -group2) %>%
spread(toSpread, count, fill = 0)
returns
dates `Count (A-I)` `Count (A-II)` `Count (L-I)` `Count (L-II)`
* <date> <dbl> <dbl> <dbl> <dbl>
1 2011-10-09 1 0 0 0
2 2011-10-10 1 0 0 0
3 2011-10-11 1 0 0 0
4 2011-10-12 1 0 0 0
5 2011-10-13 0 0 0 0
6 2011-10-14 0 0 0 0
7 2011-10-15 1 0 0 0
8 2011-10-16 2 0 0 0
9 2011-10-17 2 0 0 0
10 2011-10-18 3 0 0 0
11 2011-10-19 2 0 0 0
12 2011-10-20 1 0 0 0
13 2011-10-21 1 0 1 0
14 2011-10-22 0 0 2 0
15 2011-10-23 0 0 2 0
16 2011-10-24 0 1 2 0
来源:https://stackoverflow.com/questions/47781899/fill-count-sum-based-on-previous-row-count-over-time-series