I have a data set customerId, transactionDate, productId, purchaseQty loaded into a data.table. for each row, I want to calculate the sum, and mean of purchaseQty for the pr
First, we find how many transaction dates occur in 45 day window prior to the current date (including current date)
setDT(df)
df[, n:= 1:.N - findInterval(transactionDate - 45, transactionDate), by=.(customerID)]
df
# productId customerID transactionDate purchaseQty n
#1: 870826 1186951 2016-03-28 162000 1
#2: 870826 1244216 2016-03-31 5000 1
#3: 870826 1244216 2016-04-08 6500 2
#4: 870826 1308671 2016-03-28 221367 1
#5: 870826 1308671 2016-03-29 83633 2
#6: 870826 1308671 2016-11-29 60500 1
Next we find a rolling sum of purchaseQty
with window size n
. Adopting a great answer here
g <- function(x, window){
b_pos <- seq_along(x) - window + 1 # begin positions
cum <- cumsum(x)
cum - cum[b_pos] + x[b_pos]
}
df[, sumWindowPurchases := g(purchaseQty, n),][,n:=NULL,]
df
# productId customerID transactionDate purchaseQty sumWindowPurchases
#1: 870826 1186951 2016-03-28 162000 162000
#2: 870826 1244216 2016-03-31 5000 5000
#3: 870826 1244216 2016-04-08 6500 11500
#4: 870826 1308671 2016-03-28 221367 221367
#5: 870826 1308671 2016-03-29 83633 305000
#6: 870826 1308671 2016-11-29 60500 60500
structure(list(productId = c(870826L, 870826L, 870826L, 870826L,
870826L, 870826L), customerID = c(1186951L, 1244216L, 1244216L,
1308671L, 1308671L, 1308671L), transactionDate = structure(c(16888,
16891, 16899, 16888, 16889, 17134), class = "Date"), purchaseQty = c(162000L,
5000L, 6500L, 221367L, 83633L, 60500L)), .Names = c("productId",
"customerID", "transactionDate", "purchaseQty"), row.names = c("1:",
"2:", "3:", "4:", "5:", "6:"), class = "data.frame")
This also works, it could be considered simpler. It has the advantage of not requiring a sorted input set, and has fewer dependencies.
I still don't know understand why it produces 2 transactionDate columns in the output. This seems to be a byproduct of the "on" clause. In fact, columns and order of the output seems to append the sum after all elements of the on clause, without their alias names
DT[.(p=productId, c=customerID, tmin=transactionDate - 45, tmax=transactionDate),
on = .(productId==p, customerID==c, transactionDate<=tmax, transactionDate>=tmin),
.(windowSum = sum(purchaseQty)), by = .EACHI, nomatch = 0]