Select last row by group for all columns data.table

后端 未结 2 1313
走了就别回头了
走了就别回头了 2020-12-06 07:27

I was surprised doing the following:

R) system.time(lastOrder <- order[,lapply(.SD,tail,1),by=\"TRADER_ID,EXEC_IDATE\"]);
utilisateur     système      éco         


        
相关标签:
2条回答
  • 2020-12-06 08:06

    Last row by group :

    DT[, .SD[.N], by="TRADER_ID,EXEC_IDATE"]            # (1)
    

    or, faster (avoid use of .SD where possible, for speed) :

    w = DT[, .I[.N], by="TRADER_ID,EXEC_IDATE"][[3]]    # (2)
    DT[w]
    

    Note that the following feature request will make approach (1) as fast as approach (2) :

    FR#2330 Optimize .SD[i] query to keep the elegance but make it faster unchanged.

    0 讨论(0)
  • 2020-12-06 08:33

    How about something like this? (Synthetic data meant to mimic what I can infer about yours from the question)

    tmp <- data.table(id = sample(1:20, 1e6, replace=TRUE),
                      date = as.Date(as.integer(runif(n=1e6, min = 1e4, max = 1.1e4)),
                                     origin = as.Date("1970-01-01")),
                      data1 = rnorm(1e6),
                      data2 = rnorm(1e6),
                      data3 = rnorm(1e6))
    
    > system.time(X <- tmp[, lapply(.SD, tail, 1), by = list(id, date)])
      user  system elapsed 
      1.95    0.00    1.95
    
    > system.time(Y <- tmp[, list(tail(data1, 1)), by = list(id, date)])
      user  system elapsed 
      1.24    0.01    1.26 
    
    > system.time({
        setkey(tmp, id, date)
        Z <- tmp[unique(tmp)[, key(tmp), with=FALSE], mult="last"]
    })
      user  system elapsed 
      0.90    0.02    0.92 
    

    X and Z are the same after same order is ensured:

    > identical(setkey(X, id, date), setkey(Z, id, date))
    [1] TRUE
    

    The difference between my lapply tail and 1-column tail isn't as drastic as yours, but without the structure of your data, it's hard to say more.

    Also, note that most of the time in this method is setting the key. If the table is already sorted by the grouping columns, it goes really fast:

    > system.time(Z <- tmp[unique(tmp)[, key(tmp), with=FALSE], mult="last"])
      user  system elapsed 
      0.03    0.00    0.03 
    

    Alternatively, you could translate the many column problem to the 1-column problem with a temporary column:

    > system.time({
      tmp[, row.num := seq_len(nrow(tmp))]
      W <- tmp[tmp[, max(row.num), by = list(id, date)]$V1][, row.num := NULL]
      tmp[, row.num := NULL]
    })
    user  system elapsed 
    0.92    0.00    1.09 
    
    > identical(setkey(X, id, date), setkey(W, id, date))
    [1] TRUE
    
    0 讨论(0)
提交回复
热议问题