R: fast (conditional) subsetting where feasible

前端 未结 2 1556
忘了有多久
忘了有多久 2020-12-11 04:31

I would like to subset rows of my data

library(data.table); set.seed(333); n <- 100
dat <- data.table(id=1:n, x=runif(n,100,120), y=runif(n,200,220),          


        
相关标签:
2条回答
  • 2020-12-11 05:31

    An interesting approach could be developed using modified filter function offered in dplyr. In case of conditions not being met the non_empty_filter filter function returns original data set.

    Notes

    • IMHO, this is fairly non-standard behaviour and should be reported via warning. Of course, this can be removed and has no bearing on the function results.

    Function

    library(tidyverse)
    library(rlang) # enquo
    non_empty_filter <- function(df, expr) {
        expr <- enquo(expr)
    
        res <- df %>% filter(!!expr)
    
        if (nrow(res) > 0) {
            return(res)
        } else {
            # Indicate that filter is not applied
            warning("No rows meeting conditon")
            return(df)
        }
    }
    

    Condition met

    Behaviour: Returning one row for which the condition is met.

    dat %>%
        non_empty_filter(x > 119 & y > 219)
    

    Results

    # id        x        y        z
    # 1 55 119.2634 219.0044 315.6556
    

    Condition not met

    Behaviour: Returning the full data set as the whole condition is not met due to y > 1e6.

    dat %>%
        non_empty_filter(x > 119 & y > 219 & y > 1e6)
    

    Results

    # id        x        y        z
    # 1:   1 109.3400 208.6732 308.7595
    # 2:   2 101.6920 201.0989 310.1080
    # 3:   3 119.4697 217.8550 313.9384
    # 4:   4 111.4261 205.2945 317.3651
    # 5:   5 100.4024 212.2826 305.1375
    # 6:   6 114.4711 203.6988 319.4913
    # 7:   7 112.1879 209.5716 319.6732
    # 8:   8 106.1344 202.2453 312.9427
    # 9:   9 101.2702 210.5923 309.2864
    # 10:  10 106.1071 211.8266 301.0645
    

    Condition met/not met one-by-one

    Behaviour: Skipping filter that would return an empty data set.

    dat %>%
        non_empty_filter(y > 1e6) %>% 
        non_empty_filter(x > 119) %>% 
        non_empty_filter(y > 219)
    

    Results

    # id        x        y        z
    # 1 55 119.2634 219.0044 315.6556
    
    0 讨论(0)
  • 2020-12-11 05:35

    I agree with Konrad's answer that this should throw a warning or at least report what happens somehow. Here's a data.table way that will take advantage of indices (see package vignettes for details):

    f = function(x, ..., verbose=FALSE){
      L   = substitute(list(...))[-1]
      mon = data.table(cond = as.character(L))[, skip := FALSE]
    
      for (i in seq_along(L)){
        d = eval( substitute(x[cond, verbose=v], list(cond = L[[i]], v = verbose)) )
        if (nrow(d)){
          x = d
        } else {
          mon[i, skip := TRUE]
        }    
      }
      print(mon)
      return(x)
    }
    

    Usage

    > f(dat, x > 119, y > 219, y > 1e6)
            cond  skip
    1:   x > 119 FALSE
    2:   y > 219 FALSE
    3: y > 1e+06  TRUE
       id        x        y        z
    1: 55 119.2634 219.0044 315.6556
    

    The verbose option will print extra info provided by data.table package, so you can see when indices are being used. For example, with f(dat, x == 119, verbose=TRUE), I see it.

    because I fear the if-then jungle would be rather slow, especially since I need to apply all of this to different data.tables within a list using lapply(.).

    If it's for non-interactive use, maybe better to have the function return list(mon = mon, x = x) to more easily keep track of what the query was and what happened. Also, the verbose console output could be captured and returned.

    0 讨论(0)
提交回复
热议问题