R data.table join with inequality conditions

后端 未结 2 2153
南笙
南笙 2021-02-08 06:47

I would like to subset my data based on multiple inequality conditions using the data.table package. The examples in the data.table manual show how to do this with character va

相关标签:
2条回答
  • 2021-02-08 07:20

    I run some benchmarks

    library(dplyr)
    library(data.table)
    library(microbenchmark)
    
    dt.data.frame.way <- function(data) data[X > 0 & Y > 0 & Z > 0]
    dplyr.way <- function(df) filter(df, X > 0, Y > 0, Z > 0)
    real.data.frame.way <- function(df) df[df$X > 0 & df$Y > 0 & df$Z > 0,]
    
    data <- data.table(X=seq(-5,5,1), Y=seq(-5,5,1), Z=seq(-5,5,1))
    setkey(data, X, Y, Z)
    df <- as.data.frame(data)
    
    microbenchmark(times = 10,
                   dt.data.frame.way(data),
                   dplyr.way(df),
                   real.data.frame.way(df))
    # Unit: microseconds
    #                     expr     min       lq       mean    median       uq        max neval
    #  dt.data.frame.way(data) 710.426  754.287   871.8784  824.7565  942.998   1180.458    10
    #            dplyr.way(df) 951.309 1045.246 12303.3462 1142.7440 1246.668 112775.934    10
    #  real.data.frame.way(df) 137.239  162.591   181.5254  187.9785  197.373    231.594    10
    

    Simple clone example data to 5.5M rows.

    data <- data.table(X=seq(-5,5,1), Y=seq(-5,5,1), Z=seq(-5,5,1))
    data <- rbindlist(lapply(1:5e5, function(i) data)) # 5500000 rows
    setkey(data, X, Y, Z)
    df <- as.data.frame(data)
    
    microbenchmark(times = 10,
                   dt.data.frame.way(data),
                   dplyr.way(df),
                   real.data.frame.way(df))
    # Unit: milliseconds
    #                     expr      min        lq      mean    median        uq       max neval
    #  dt.data.frame.way(data) 656.2978  668.0560  730.9246  696.6560  831.0877  846.0517    10
    #            dplyr.way(df) 632.4096  639.1141  709.4308  678.9436  717.3018 1015.7663    10
    #  real.data.frame.way(df) 964.4298 1022.1772 1075.8448 1077.4437 1125.0037 1192.7410    10
    

    Performance of that task seems to be hard to improve. Often it depends on the data.

    0 讨论(0)
  • 2021-02-08 07:21

    The solution is quite fast and straightforward using the package dplyr.

    install.packages(dplyr)
    library(dplyr)
    
    newdata <- filter(data, X > 0 , Y > 0 , Z > 0)
    

    dplyr is showing to be one of the easiest and fastest packages for managing data frames. Check this great tutorial here: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

    The RStudio team have alsoe produced a nice Cheat Sheet, here: http://www.rstudio.com/resources/cheatsheets/

    0 讨论(0)
提交回复
热议问题