I\'m trying to calculate a rolling count/sum of occurrences by group over the series of a time frame.
I have a data frame with some sample data like this:
For this example, you can probably use sapply
to analyze each row, counting the number of entries on that day or up to 4 days earlier, like so:
df1$lastFour <-
sapply(df1$dates, function(x){
sum(df1$dates <= x & df1$dates >= x - 4)
})
Results in df1
of:
dates group1 group2 lastFour
1 2011-10-09 A D 1
2 2011-10-15 C A 1
3 2011-10-16 A B 2
4 2011-10-18 A H 3
5 2011-10-21 L A 2
6 2011-10-22 F A 3
7 2011-10-24 A E 3
If, as your question implies, your data are from a larger set and you want to do the analysis on each group (conceptually, I think the question is: how many events have had this group in the last four days? asked only on days with an event from that group), you could follow the steps below.
First, here are some larger sample data with groups labelled as the first 10 letters of the alphabet:
biggerData <-
data.frame(
dates = sample(seq(as.Date("2011-10-01")
, as.Date("2011-10-31")
, 1)
, 100, TRUE)
, group1 = sample(LETTERS[1:10], 100, TRUE)
, group2 = sample(LETTERS[1:10], 100, TRUE)
)
Next, I extract all of the groups in the data (here, I know them, but for your real data, you may or may not have that list of groups already)
groupsInData <-
sort(unique(c(as.character(biggerData$group1)
, as.character(biggerData$group2))))
Then, I loop through that vector of group names and extract each of the events with that group as one of the two groups, adding the same column as above, and saving the separate data.frames in a list (and naming them to make it easier to access/track them).
sepGroupCounts <- lapply(groupsInData, function(thisGroup){
dfTemp <- biggerData[biggerData$group1 == thisGroup |
biggerData$group2 == thisGroup, ]
dfTemp$lastFour <-
sapply(dfTemp$dates, function(x){
sum(dfTemp$dates <= x & dfTemp$dates >= x - 4)
})
return(dfTemp)
})
names(sepGroupCounts) <- groupsInData
returns a data.frame just like above for each of the groups in your data.
And, I couldn't help myself, so here is a dplyr
and tidyr
solution as well. It is not much different than the list-based solution above, except that it returns everything in the same data.frame (which may or may not be a good thing, particularly as it will have two entries for each event this way).
First, for simplicity, I defined a function to do the date checking. This could easily be used above as well.
myDateCheckFunction <- function(x){
sapply(x, function(thisX){
sum(x <= thisX & x >= thisX - 4 )
})
}
Next, I am constructing a set of logical tests that will determine whether or not each of the groups is present. These will be used to generate columns for each group, giving TRUE/FALSE for present/absent in each event.
dotsConstruct <-
paste0("group1 == '", groupsInData, "' | "
, "group2 == '", groupsInData, "'") %>%
setNames(groupsInData)
Finally, putting it altogether in one piped call. Instead of describing, I have commented each step.
withLastFour <-
# Start with data
biggerData %>%
# Add a col for each group using Standard Evaluation
mutate_(.dots = dotsConstruct) %>%
# convert to long form; one row per group per event
gather(GroupAnalyzed, Present, -dates, -group1, -group2) %>%
# Limit to only rows where the `GroupAnalyzed` is present
filter(Present) %>%
# Remove the `Present` column, as it is now all "TRUE"
select(-Present) %>%
# Group by the groups we are analyzing
group_by(GroupAnalyzed) %>%
# Add the column for count in the last four dates
# `group_by` limits this to just counts within that group
mutate(lastFour = myDateCheckFunction(dates)) %>%
# Sort by group and date for prettier checking
arrange(GroupAnalyzed, dates)
The result is similar to the above list
output, except with everything in one data.frame, which may allow for easier analysis of some features. The top looks like this:
dates group1 group2 GroupAnalyzed lastFour
1 2011-10-01 B A A 1
2 2011-10-02 J A A 2
3 2011-10-05 C A A 5
4 2011-10-05 C A A 5
5 2011-10-05 G A A 5
6 2011-10-08 E A A 5
Note that my random sample had multiple events on Oct-05, leading to the large counts here.