Creating a new r data.table column based on values in another column and grouping

后端 未结 2 1252
北海茫月
北海茫月 2020-12-10 06:39

I have a data.table with date, zipcode and purchase amounts.

library(data.table)
set.seed(88)
DT <- data.table(date = Sys.Date()-365 + sort(s         


        
相关标签:
2条回答
  • 2020-12-10 07:25

    This seems to work:

    DT[, new_col := 
      DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1), 
        sum(purchaseAmount)
      , by=.EACHI ]$V1
    ]
    
    
              date  zip purchaseAmount new_col
     1: 2016-01-08 1150              5       5
     2: 2016-01-15 3000             15      15
     3: 2016-02-15 1150             16      16
     4: 2016-02-20 2000             18      18
     5: 2016-03-07 2000             19      19
     6: 2016-03-15 2000             11      30
     7: 2016-03-17 2000              6      36
     8: 2016-04-02 1150             17      17
     9: 2016-04-08 3000              7       7
    10: 2016-04-09 3000             20      27
    

    This uses a "non-equi" join, effectively taking each row; finding all rows that meet our criteria in the on= expression for each row; and then summing by row (by=.EACHI). In this case, a non-equi join is probably less efficient than some rolling-sum approach.


    How it works.

    To add columns to a data.table, the usual syntax is DT[, new_col := expression]. Here, the expression actually works even outside of the DT[...]. Try running it on its own:

    DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1), 
      sum(purchaseAmount)
    , by=.EACHI ]$V1
    

    You can progressively simplify this until it's just the join...

    DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1), 
      sum(purchaseAmount)
    , by=.EACHI ]
    # note that V1 is the default name for computed columns
    
    DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1)]
    # now we're down to just the join
    

    The join syntax is like x[i, on=.(xcol = icol, xcol2 < icol2)], as documented in the doc page that opens when you type ?data.table into an R console with the data.table package loaded.

    To get started with data.table, I'd suggest reviewing the vignettes. After that, this'll probably look a lot more legible.

    0 讨论(0)
  • 2020-12-10 07:34

    I didn't find any data.table solutions, this is how I got it though:

    library(dplyr)
    earlierPurchases <- vector()
    
    for(i in 1:nrow(DT)) {
      temp <- dplyr::filter(DT, zip == zip[i] & date < date[i])
      earlierPurchases[i] <- sum(temp$purchaseAmount)
    }
    
    DT <- cbind(DT, earlierPurchases)
    

    It worked quite fast.

    0 讨论(0)
提交回复
热议问题