relative windowed running sum through data.table non-equi join

后端 未结 2 1582
-上瘾入骨i
-上瘾入骨i 2021-01-05 13:35

I have a data set customerId, transactionDate, productId, purchaseQty loaded into a data.table. for each row, I want to calculate the sum, and mean of purchaseQty for the pr

相关标签:
2条回答
  • 2021-01-05 13:58

    First, we find how many transaction dates occur in 45 day window prior to the current date (including current date)

    setDT(df)
    df[, n:= 1:.N - findInterval(transactionDate - 45, transactionDate), by=.(customerID)]
    df
    #   productId customerID transactionDate purchaseQty n
    #1:    870826    1186951      2016-03-28      162000 1
    #2:    870826    1244216      2016-03-31        5000 1
    #3:    870826    1244216      2016-04-08        6500 2
    #4:    870826    1308671      2016-03-28      221367 1
    #5:    870826    1308671      2016-03-29       83633 2
    #6:    870826    1308671      2016-11-29       60500 1
    

    Next we find a rolling sum of purchaseQty with window size n. Adopting a great answer here

    g <- function(x, window){
      b_pos <- seq_along(x) - window + 1  # begin positions
      cum <- cumsum(x)
      cum - cum[b_pos] + x[b_pos]
    }
    df[, sumWindowPurchases := g(purchaseQty, n),][,n:=NULL,]
    df
    #   productId customerID transactionDate purchaseQty sumWindowPurchases
    #1:    870826    1186951      2016-03-28      162000             162000
    #2:    870826    1244216      2016-03-31        5000               5000
    #3:    870826    1244216      2016-04-08        6500              11500
    #4:    870826    1308671      2016-03-28      221367             221367
    #5:    870826    1308671      2016-03-29       83633             305000
    #6:    870826    1308671      2016-11-29       60500              60500
    

    Data

    structure(list(productId = c(870826L, 870826L, 870826L, 870826L, 
    870826L, 870826L), customerID = c(1186951L, 1244216L, 1244216L, 
    1308671L, 1308671L, 1308671L), transactionDate = structure(c(16888, 
    16891, 16899, 16888, 16889, 17134), class = "Date"), purchaseQty = c(162000L, 
    5000L, 6500L, 221367L, 83633L, 60500L)), .Names = c("productId", 
    "customerID", "transactionDate", "purchaseQty"), row.names = c("1:", 
    "2:", "3:", "4:", "5:", "6:"), class = "data.frame")
    
    0 讨论(0)
  • 2021-01-05 13:58

    This also works, it could be considered simpler. It has the advantage of not requiring a sorted input set, and has fewer dependencies.

    I still don't know understand why it produces 2 transactionDate columns in the output. This seems to be a byproduct of the "on" clause. In fact, columns and order of the output seems to append the sum after all elements of the on clause, without their alias names

    DT[.(p=productId, c=customerID, tmin=transactionDate - 45, tmax=transactionDate),
        on = .(productId==p, customerID==c, transactionDate<=tmax, transactionDate>=tmin),
        .(windowSum = sum(purchaseQty)), by = .EACHI, nomatch = 0]
    
    0 讨论(0)
提交回复
热议问题