Filter each column of a data.frame based on a specific value

前端 未结 4 1661
终归单人心
终归单人心 2020-12-09 10:54

Consider the following data frame:

df <- data.frame(replicate(5,sample(1:10,10,rep=TRUE)))

#   X1 X2 X3 X4 X5
#1   7  9  8  4 10
#2   2  4  9  4  9
#3            


        
相关标签:
4条回答
  • 2020-12-09 11:21

    How to specify a column name and mimic an hypothethical filter_each(funs(. >= 2), -X5) ?

    It might be not the most elegant solution, but it gets the job done:

    df %>% filter(!rowSums(.[,!colnames(.)%in%'X5',drop=F] < 2))
    

    In case of several excluded columns (e.g. X3,X5), one can use:

    df %>% filter(!rowSums(.[,!colnames(.)%in%c('X3','X5'),drop=F] < 2))
    
    0 讨论(0)
  • 2020-12-09 11:27

    Here's an idea that makes it fairly simple to choose the names. You can set up a list of calls to send to the .dots argument of filter_(). First a function that creates an unevaluated call.

    Call <- function(x, value, fun = ">=") call(fun, as.name(x), value)
    

    Now we use filter_(), passing a list of calls into the .dots argument using lapply(), choosing any name and value you want.

    nm <- names(df) != "X5"
    filter_(df, .dots = lapply(names(df)[nm], Call, 2L))
    #   X1 X2 X3 X4 X5
    # 1  6  5  7  3  1
    # 2  8 10  3  6  5
    # 3  5  7 10  2  5
    # 4  3  4  2  9  9
    # 5  8  3  5  6  2
    # 6  9  3  4 10  9
    # 7  2  9  7  9  8
    

    You can have a look at the unevaluated calls created by Call(), for example X4 and X5, with

    lapply(names(df)[4:5], Call, 2L)
    # [[1]]
    # X4 >= 2L
    #
    # [[2]]
    # X5 >= 2L
    

    So if you adjust the names() in the X argument of lapply(), you should be fine.

    0 讨论(0)
  • 2020-12-09 11:47

    Here's another option with slice which can be used similarly to filter in this case. Main difference is that you supply an integer vector to slice whereas filter takes a logical vector.

    df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L)))
    

    What I like about this approach is that because we use select inside rowSums you can make use of all the special functions that select supplies, like matches for example.


    Let's see how it compares to the other answers:

    df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))
    
    mbm <- microbenchmark(
        Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
        Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
        dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
        times = 50L,
        unit = "relative"
    )
    
    #Unit: relative
    #     expr      min       lq   median       uq      max neval
    #    Marat 1.304216 1.290695 1.290127 1.288473 1.290609    50
    #  Richard 1.139796 1.146942 1.124295 1.159715 1.160689    50
    # dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000    50
    

    pic

    Edit note: updated with more reliable benchmark with 50 repetitions (times = 50L).


    Following a comment that base R would have the same speed as the slice approach (without specification of what base R approach is meant exactly), I decided to update my answer with a comparison to base R using almost the same approach as in my answer. For base R I used:

    base = df[!rowSums(df[-5L] < 2L), ],
    base_which = df[which(!rowSums(df[-5L] < 2L)), ]
    

    Benchmark:

    df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))
    
    mbm <- microbenchmark(
      Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
      Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
      dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
      base = df[!rowSums(df[-5L] < 2L), ],
      base_which = df[which(!rowSums(df[-5L] < 2L)), ],
      times = 50L,
      unit = "relative"
    )
    
    #Unit: relative
    #       expr      min       lq   median       uq      max neval
    #      Marat 1.265692 1.279057 1.298513 1.279167 1.203794    50
    #    Richard 1.124045 1.160075 1.163240 1.169573 1.076267    50
    #   dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000    50
    #       base 2.784058 2.769062 2.710305 2.669699 2.576825    50
    # base_which 1.458339 1.477679 1.451617 1.419686 1.412090    50
    

    pic2

    Not really any better or comparable performance with these two base R approaches.

    Edit note #2: added benchmark with base R options.

    0 讨论(0)
  • 2020-12-09 11:47

    If you only wanted to filter on the first four columns, as:

    df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2) 
    

    ...try this:

    df %>% 
      filter_at(vars(X1:X4), #<Select columns to filter
      all_vars(.>=2) )       #<Scope with all_vars (or any_vars)
    

    An alternative is to exclude the columns you'd like to filter, as:

    df %>% 
      filter_at(vars(-X5)), #<Exclude column X5
      all_vars(.>=2) )
    
    0 讨论(0)
提交回复
热议问题