filtering with multiple conditions on many columns using dplyr

前端 未结 7 1439
野趣味
野趣味 2021-01-02 02:05

I\'ve searched on SO trying to find a solution to no avail. So here it is. I have a data frame with many columns, some of which are numerical and should be non-negative. I w

相关标签:
7条回答
  • 2021-01-02 02:29

    Here is my ugly solution. Suggestions/criticisms welcome

    df %>% 
      # Select the columns we want
      select(matches("_num$")) %>%
      # Convert every column to logical if >= 0
      lapply(">=", 0) %>%
      # Reduce all the sublist with AND 
      Reduce(f = "&", .) %>%
      # Convert the one vector of logical into numeric
      # index since slice can't deal with logical. 
      # Can simply write `{df[.,]}` here instead,
      # which is probably faster than which + slice
      # Edit: This is not true. which + slice is faster than `[` in this case
      which %>%
      slice(.data = df)
    
      id  sth1 tg1_num sth2 tg2_num others
    1  1  dave       2   ca      35    new
    2  4 leroy       0   az      25    old
    3  5 jerry       4   mi      55    old
    
    0 讨论(0)
  • 2021-01-02 02:31

    This will give you a vector of your rows that are less than 0:

    desired_rows <- sapply(target_columns, function(x) which(df[,x]<0), simplify=TRUE)
    desired_rows <- as.vector(unique(unlist(desired_rows)))
    

    Then to get a df of your desired rows:

    setdiff(df, df[desired_rows,])
      id  sth1 tg1_num sth2 tg2_num others
    1  1  dave       2   ca      35    new
    2  4 leroy       0   az      25    old
    3  5 jerry       4   mi      55    old
    
    0 讨论(0)
  • 2021-01-02 02:43

    I wanted to see this was possible using standard evaluation with dplyr's filter_. It turns out it can be done with the help of interp from lazyeval, following the example code on this page. Essentially, you have to create a list of the interp conditions which you then pass to the .dots argument of filter_.

    library(lazyeval)
    
    dots <- lapply(target_columns, function(cols){
        interp(~y >= 0, .values = list(y = as.name(cols)))
    })
    
    filter_(df, .dots = dots)   
    
      id  sth1 tg1_num sth2 tg2_num others
    1  1  dave       2   ca      35    new
    2  4 leroy       0   az      25    old
    3  5 jerry       4   mi      55    old
    

    Update

    Starting with dplyr_0.7, this can be done directly with filter_at and all_vars (no lazyeval needed).

    df %>%
         filter_at(vars(target_columns), all_vars(. >= 0) )
    
      id  sth1 tg1_num sth2 tg2_num others
    1  1  dave       2   ca      35    new
    2  4 leroy       0   az      25    old
    3  5 jerry       4   mi      55    old
    
    0 讨论(0)
  • 2021-01-02 02:43

    Using base R to get your result

    cond <- df[, grepl("_num$", colnames(df))] >= 0
    df[apply(cond, 1, function(x) {prod(x) == 1}), ]
    
      id  sth1 tg1_num sth2 tg2_num others
    1  1  dave       2   ca      35    new
    4  4 leroy       0   az      25    old
    5  5 jerry       4   mi      55    old
    

    Edit: this assumes you have multiple columns with "_num". It won't work if you have just one _num column

    0 讨论(0)
  • 2021-01-02 02:44

    First we create an index of all numeric columns. Then we subset all columns greater or equal than zero. So there is no need to check the column names, and the column id will be always positive.

    nums <- sapply(df, is.numeric)
    df[apply(df[, nums], MARGIN = 1, function(x) all(x >= 0)), ]
    

    Output:

      id  sth1 tg1_num sth2 tg2_num others
    1  1  dave       2   ca      35    new
    4  4 leroy       0   az      25    old
    5  5 jerry       4   mi      55    old
    
    0 讨论(0)
  • 2021-01-02 02:45

    Here's a possible vectorized solution

    ind <- grep("_num$", colnames(df))
    df[!rowSums(df[ind] < 0),]
    #   id  sth1 tg1_num sth2 tg2_num others
    # 1  1  dave       2   ca      35    new
    # 4  4 leroy       0   az      25    old
    # 5  5 jerry       4   mi      55    old
    

    The idea here is to create a logical matrix using the < function (it is a generic function which has data.frame method - which means it returns a data frame like structure back). Then, we are using rowSums to find if there were any matched conditions (> 0 - matched, 0- not matched). Then, we are using the ! function in order to convert it to a logical vector: >0 becomes TRUE, while 0 becomes FALSE. Finally, we are subsetting according to that vector.

    0 讨论(0)
提交回复
热议问题