Remove columns of dataframe based on conditions in R

前端未结

关注

 2  435

I have to remove columns in my dataframe which has over 4000 columns and 180 rows.The conditions I want to set in to remove the column in the dataframe are: (i) Remove the c

相关标签:

2条回答

一生所求

2020-12-10 00:24

Create logical vectors for each condition:

# condition 1
cond1 <- sapply(df, function(col) sum(!is.na(col)) < 2)

# condition 2
cond2 <- sapply(df, function(col) !any(diff(which(!is.na(col))) == 1))

# condition 3
cond3 <- sapply(df, function(col) all(is.na(col)))

Then combine them into one mask:

mask <- !(cond1 | cond2 | cond3)

> df[,mask,drop=F]
       A     E
1  0.018    NA
2  0.017    NA
3  0.019    NA
4  0.018    NA
5  0.018    NA
6  0.015 0.037
7  0.016 0.031
8  0.019 0.025
9  0.016 0.035
10 0.018 0.035
11 0.017 0.043
12 0.023 0.040
13 0.022 0.042

0 讨论(0)

Happy的楠姐

2020-12-10 00:26
I feel like this is all over-complicated. Condition 2 already includes all the rest of the conditions, as if there are at least two non-NA values in a column, obviously the whole column aren't NAs. And if there are at least two consecutive values in a column, then obviously this column contains more than one value. So instead of 3 conditions, this all sums up into a single condition (I prefer not to run many functions per column, rather after running diff per column- vecotrize the whole thing):
```
cond <- colSums(is.na(sapply(df, diff))) < nrow(df) - 1
```
This works because if there are no consecutive values in a column, the whole column will become NAs.

Then, just
```
df[, cond, drop = FALSE]
#        A     E
# 1  0.018    NA
# 2  0.017    NA
# 3  0.019    NA
# 4  0.018    NA
# 5  0.018    NA
# 6  0.015 0.037
# 7  0.016 0.031
# 8  0.019 0.025
# 9  0.016 0.035
# 10 0.018 0.035
# 11 0.017 0.043
# 12 0.023 0.040
# 13 0.022 0.042
```
Per your edit, it seems like you have a data.table object and you also have a Date column so the code would need some modifications.
```
cond <- df[, lapply(.SD, function(x) sum(is.na(diff(x)))) < .N - 1, .SDcols = -1] 
df[, c(TRUE, cond), with = FALSE]
```
Some explanations:
- We want to ignore the first column in our calculations so we specify .SDcols = -1 when operating on our .SD (which means Sub Data in data.tableis)
- .N is just the rows count (similar to nrow(df)
- Next step is to subset by condition. We need not forget to grab the first column too so we start with c(TRUE,...
- Finally, data.table works with non standard evaluation by default, hence, if you want to select column as if you would in a data.frame you will need to specify with = FALSE
A better way though, would be just to remove the column by reference using := NULL
```
cond <- c(FALSE, df[, lapply(.SD, function(x) sum(is.na(diff(x)))) == .N - 1, .SDcols = -1])
df[, which(cond) := NULL]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...