Remove columns of dataframe based on conditions in R

前端 未结 2 435
死守一世寂寞
死守一世寂寞 2020-12-10 00:00

I have to remove columns in my dataframe which has over 4000 columns and 180 rows.The conditions I want to set in to remove the column in the dataframe are: (i) Remove the c

相关标签:
2条回答
  • 2020-12-10 00:24

    Create logical vectors for each condition:

    # condition 1
    cond1 <- sapply(df, function(col) sum(!is.na(col)) < 2)
    
    # condition 2
    cond2 <- sapply(df, function(col) !any(diff(which(!is.na(col))) == 1))
    
    # condition 3
    cond3 <- sapply(df, function(col) all(is.na(col)))
    

    Then combine them into one mask:

    mask <- !(cond1 | cond2 | cond3)
    
    > df[,mask,drop=F]
           A     E
    1  0.018    NA
    2  0.017    NA
    3  0.019    NA
    4  0.018    NA
    5  0.018    NA
    6  0.015 0.037
    7  0.016 0.031
    8  0.019 0.025
    9  0.016 0.035
    10 0.018 0.035
    11 0.017 0.043
    12 0.023 0.040
    13 0.022 0.042
    
    0 讨论(0)
  • 2020-12-10 00:26

    I feel like this is all over-complicated. Condition 2 already includes all the rest of the conditions, as if there are at least two non-NA values in a column, obviously the whole column aren't NAs. And if there are at least two consecutive values in a column, then obviously this column contains more than one value. So instead of 3 conditions, this all sums up into a single condition (I prefer not to run many functions per column, rather after running diff per column- vecotrize the whole thing):

    cond <- colSums(is.na(sapply(df, diff))) < nrow(df) - 1
    

    This works because if there are no consecutive values in a column, the whole column will become NAs.

    Then, just

    df[, cond, drop = FALSE]
    #        A     E
    # 1  0.018    NA
    # 2  0.017    NA
    # 3  0.019    NA
    # 4  0.018    NA
    # 5  0.018    NA
    # 6  0.015 0.037
    # 7  0.016 0.031
    # 8  0.019 0.025
    # 9  0.016 0.035
    # 10 0.018 0.035
    # 11 0.017 0.043
    # 12 0.023 0.040
    # 13 0.022 0.042
    

    Per your edit, it seems like you have a data.table object and you also have a Date column so the code would need some modifications.

    cond <- df[, lapply(.SD, function(x) sum(is.na(diff(x)))) < .N - 1, .SDcols = -1] 
    df[, c(TRUE, cond), with = FALSE]
    

    Some explanations:

    • We want to ignore the first column in our calculations so we specify .SDcols = -1 when operating on our .SD (which means Sub Data in data.tableis)
    • .N is just the rows count (similar to nrow(df)
    • Next step is to subset by condition. We need not forget to grab the first column too so we start with c(TRUE,...
    • Finally, data.table works with non standard evaluation by default, hence, if you want to select column as if you would in a data.frame you will need to specify with = FALSE

    A better way though, would be just to remove the column by reference using := NULL

    cond <- c(FALSE, df[, lapply(.SD, function(x) sum(is.na(diff(x)))) == .N - 1, .SDcols = -1])
    df[, which(cond) := NULL]
    
    0 讨论(0)
提交回复
热议问题