r subscript of dataframe with condition values by a vector

问题

this seems to be rather easy, but it keeps my busy since a while.

I have a dataframe (df) with n columns and a vector with the same number (n) of values.

The values in the vector are thresholds for the observations in the columns in the dataframe. So the clue is, how to tell R to use different thresholds for each column?

I want to keep all the observations in the dataframe which fulfill the various thresholds for each column (above or below, doesnt matter in the example). The observations which do not fulfill the threshold criterion should be set to 0.

I dont want a subset of the dataframe.

Can anyone help? Thanks a lot in advance.

回答1:

Given some example data and thresholds

set.seed(42)
dat <- data.frame(matrix(runif(100), ncol = 10))

## thresholds
thresh <- seq(0.5, 0.95, length.out = 10)
thresh

we can use the mapply() function to work out which observations in each column (in this) are greater than or equal to the threshold. Using those indices, we can replace the values corresponding to the indices with 0 via:

dat[mapply(">=", dat, thresh)] <- 0

Here is the call in action:

> dat
          X1        X2         X3          X4         X5
1  0.9148060 0.4577418 0.90403139 0.737595618 0.37955924
2  0.9370754 0.7191123 0.13871017 0.811055141 0.43577158
3  0.2861395 0.9346722 0.98889173 0.388108283 0.03743103
4  0.8304476 0.2554288 0.94666823 0.685169729 0.97353991
5  0.6417455 0.4622928 0.08243756 0.003948339 0.43175125
6  0.5190959 0.9400145 0.51421178 0.832916080 0.95757660
7  0.7365883 0.9782264 0.39020347 0.007334147 0.88775491
8  0.1346666 0.1174874 0.90573813 0.207658973 0.63997877
9  0.6569923 0.4749971 0.44696963 0.906601408 0.97096661
10 0.7050648 0.5603327 0.83600426 0.611778643 0.61883821
           X6        X7          X8         X9          X10
1  0.33342721 0.6756073 0.042988796 0.58160400 0.6674265147
2  0.34674825 0.9828172 0.140479094 0.15790521 0.0002388966
3  0.39848541 0.7595443 0.216385415 0.35902831 0.2085699569
4  0.78469278 0.5664884 0.479398564 0.64563188 0.9330341273
5  0.03893649 0.8496897 0.197410342 0.77582336 0.9256447486
6  0.74879539 0.1894739 0.719355838 0.56364684 0.7340943010
7  0.67727683 0.2712866 0.007884739 0.23370340 0.3330719834
8  0.17126433 0.8281585 0.375489965 0.08998052 0.5150633298
9  0.26108796 0.6932048 0.514407708 0.08561206 0.7439746463
10 0.51441293 0.2405447 0.001570554 0.30521837 0.6191592400
> dat[mapply(">=", dat, thresh)] <- 0
> dat
          X1        X2         X3          X4         X5
1  0.0000000 0.4577418 0.00000000 0.000000000 0.37955924
2  0.0000000 0.0000000 0.13871017 0.000000000 0.43577158
3  0.2861395 0.0000000 0.00000000 0.388108283 0.03743103
4  0.0000000 0.2554288 0.00000000 0.000000000 0.00000000
5  0.0000000 0.4622928 0.08243756 0.003948339 0.43175125
6  0.0000000 0.0000000 0.51421178 0.000000000 0.00000000
7  0.0000000 0.0000000 0.39020347 0.007334147 0.00000000
8  0.1346666 0.1174874 0.00000000 0.207658973 0.63997877
9  0.0000000 0.4749971 0.44696963 0.000000000 0.00000000
10 0.0000000 0.0000000 0.00000000 0.611778643 0.61883821
           X6        X7          X8         X9          X10
1  0.33342721 0.6756073 0.042988796 0.58160400 0.6674265147
2  0.34674825 0.0000000 0.140479094 0.15790521 0.0002388966
3  0.39848541 0.7595443 0.216385415 0.35902831 0.2085699569
4  0.00000000 0.5664884 0.479398564 0.64563188 0.9330341273
5  0.03893649 0.0000000 0.197410342 0.77582336 0.9256447486
6  0.74879539 0.1894739 0.719355838 0.56364684 0.7340943010
7  0.67727683 0.2712866 0.007884739 0.23370340 0.3330719834
8  0.17126433 0.0000000 0.375489965 0.08998052 0.5150633298
9  0.26108796 0.6932048 0.514407708 0.08561206 0.7439746463
10 0.51441293 0.2405447 0.001570554 0.30521837 0.6191592400

It is instructive to notice what mapply() returns in this case:

> mapply(">=", dat, thresh)
         X1    X2    X3    X4    X5    X6    X7    X8    X9   X10
 [1,]  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
 [2,]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
 [3,] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [4,]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
 [5,]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [6,]  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
 [7,]  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
 [8,] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [9,]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
[10,]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

and it is those logical values that are used to select the observations that meet the threshold. You can a different binary operator to the one I used; see ?">" for the various options. When writing the mapply() call, think of it in terms of left-hand-side and right-hand-side of the binary operator, such that an mapply() call would give:

mapply(">", lhs, rhs)

where we might write

lhs > rhs

Update: As @DWin has answered the comment about two thresholds I will update my Answer to match.

thresh1 <- seq(0.05, 0.5, length.out = 10)
thresh2 <- seq(0.55, 0.95, length.out = 10)
set.seed(42)
dat <- data.frame(matrix(runif(100), ncol = 10))

l1 <- mapply(">", dat, thresh1)
l2 <- mapply("<", dat, thresh2)

We can see which elements match both constraints:

> l1 & l2
         X1    X2    X3    X4    X5    X6    X7    X8    X9   X10
 [1,] FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
 [2,] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
 [3,]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE
 [4,] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
 [5,] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE
 [6,]  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
 [7,] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [8,]  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [9,] FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE
[10,] FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE

and the same construct can be used to select those elements that match:

dat[l1 & l2] <- 0
dat

> dat
          X1        X2         X3          X4         X5         X6        X7          X8
1  0.9148060 0.0000000 0.90403139 0.737595618 0.00000000 0.00000000 0.0000000 0.042988796
2  0.9370754 0.7191123 0.13871017 0.811055141 0.00000000 0.00000000 0.9828172 0.140479094
3  0.0000000 0.9346722 0.98889173 0.000000000 0.03743103 0.00000000 0.0000000 0.216385415
4  0.8304476 0.0000000 0.94666823 0.685169729 0.97353991 0.78469278 0.0000000 0.000000000
5  0.6417455 0.0000000 0.08243756 0.003948339 0.00000000 0.03893649 0.8496897 0.197410342
6  0.0000000 0.9400145 0.00000000 0.832916080 0.95757660 0.00000000 0.1894739 0.000000000
7  0.7365883 0.9782264 0.00000000 0.007334147 0.88775491 0.00000000 0.2712866 0.007884739
8  0.0000000 0.0000000 0.90573813 0.000000000 0.00000000 0.17126433 0.8281585 0.375489965
9  0.6569923 0.0000000 0.00000000 0.906601408 0.97096661 0.26108796 0.0000000 0.000000000
10 0.7050648 0.0000000 0.83600426 0.000000000 0.00000000 0.00000000 0.2405447 0.001570554
           X9          X10
1  0.00000000 0.0000000000
2  0.15790521 0.0002388966
3  0.35902831 0.2085699569
4  0.00000000 0.0000000000
5  0.00000000 0.0000000000
6  0.00000000 0.0000000000
7  0.23370340 0.3330719834
8  0.08998052 0.0000000000
9  0.08561206 0.0000000000
10 0.30521837 0.0000000000

回答2:

I like Gavin's answer better than mine, but here's a slightly different application of mapply using his data:

mapply(function(x,tt) ifelse(x >= tt, 0, x), dat, thresh)

In light of your second comment: my construction might be more generalizable than Gavin's

Two threshold vectors:

mapply(function(x, lt, ht) ifelse(x <= lt | x >= ht , 0, x), dat, lothresh, hithresh)

回答3:

Not sure how it's going to work with data frames, but the following worked with matrices:
You can get a boolean representation of df under the given condition and then use it as indexing of df to set the values. Alternatively you can get a vector with indexes of the matching fields and use it as index vector to set the values. Hope that helps.

来源：https://stackoverflow.com/questions/10899823/r-subscript-of-dataframe-with-condition-values-by-a-vector

标签

dataframe

subscript