问题
this seems to be rather easy, but it keeps my busy since a while.
I have a dataframe (df) with n columns and a vector with the same number (n) of values.
The values in the vector are thresholds for the observations in the columns in the dataframe. So the clue is, how to tell R to use different thresholds for each column?
I want to keep all the observations in the dataframe which fulfill the various thresholds for each column (above or below, doesnt matter in the example). The observations which do not fulfill the threshold criterion should be set to 0.
I dont want a subset of the dataframe.
Can anyone help? Thanks a lot in advance.
回答1:
Given some example data and thresholds
set.seed(42)
dat <- data.frame(matrix(runif(100), ncol = 10))
## thresholds
thresh <- seq(0.5, 0.95, length.out = 10)
thresh
we can use the mapply()
function to work out which observations in each column (in this) are greater than or equal to the threshold. Using those indices, we can replace the values corresponding to the indices with 0
via:
dat[mapply(">=", dat, thresh)] <- 0
Here is the call in action:
> dat
X1 X2 X3 X4 X5
1 0.9148060 0.4577418 0.90403139 0.737595618 0.37955924
2 0.9370754 0.7191123 0.13871017 0.811055141 0.43577158
3 0.2861395 0.9346722 0.98889173 0.388108283 0.03743103
4 0.8304476 0.2554288 0.94666823 0.685169729 0.97353991
5 0.6417455 0.4622928 0.08243756 0.003948339 0.43175125
6 0.5190959 0.9400145 0.51421178 0.832916080 0.95757660
7 0.7365883 0.9782264 0.39020347 0.007334147 0.88775491
8 0.1346666 0.1174874 0.90573813 0.207658973 0.63997877
9 0.6569923 0.4749971 0.44696963 0.906601408 0.97096661
10 0.7050648 0.5603327 0.83600426 0.611778643 0.61883821
X6 X7 X8 X9 X10
1 0.33342721 0.6756073 0.042988796 0.58160400 0.6674265147
2 0.34674825 0.9828172 0.140479094 0.15790521 0.0002388966
3 0.39848541 0.7595443 0.216385415 0.35902831 0.2085699569
4 0.78469278 0.5664884 0.479398564 0.64563188 0.9330341273
5 0.03893649 0.8496897 0.197410342 0.77582336 0.9256447486
6 0.74879539 0.1894739 0.719355838 0.56364684 0.7340943010
7 0.67727683 0.2712866 0.007884739 0.23370340 0.3330719834
8 0.17126433 0.8281585 0.375489965 0.08998052 0.5150633298
9 0.26108796 0.6932048 0.514407708 0.08561206 0.7439746463
10 0.51441293 0.2405447 0.001570554 0.30521837 0.6191592400
> dat[mapply(">=", dat, thresh)] <- 0
> dat
X1 X2 X3 X4 X5
1 0.0000000 0.4577418 0.00000000 0.000000000 0.37955924
2 0.0000000 0.0000000 0.13871017 0.000000000 0.43577158
3 0.2861395 0.0000000 0.00000000 0.388108283 0.03743103
4 0.0000000 0.2554288 0.00000000 0.000000000 0.00000000
5 0.0000000 0.4622928 0.08243756 0.003948339 0.43175125
6 0.0000000 0.0000000 0.51421178 0.000000000 0.00000000
7 0.0000000 0.0000000 0.39020347 0.007334147 0.00000000
8 0.1346666 0.1174874 0.00000000 0.207658973 0.63997877
9 0.0000000 0.4749971 0.44696963 0.000000000 0.00000000
10 0.0000000 0.0000000 0.00000000 0.611778643 0.61883821
X6 X7 X8 X9 X10
1 0.33342721 0.6756073 0.042988796 0.58160400 0.6674265147
2 0.34674825 0.0000000 0.140479094 0.15790521 0.0002388966
3 0.39848541 0.7595443 0.216385415 0.35902831 0.2085699569
4 0.00000000 0.5664884 0.479398564 0.64563188 0.9330341273
5 0.03893649 0.0000000 0.197410342 0.77582336 0.9256447486
6 0.74879539 0.1894739 0.719355838 0.56364684 0.7340943010
7 0.67727683 0.2712866 0.007884739 0.23370340 0.3330719834
8 0.17126433 0.0000000 0.375489965 0.08998052 0.5150633298
9 0.26108796 0.6932048 0.514407708 0.08561206 0.7439746463
10 0.51441293 0.2405447 0.001570554 0.30521837 0.6191592400
It is instructive to notice what mapply()
returns in this case:
> mapply(">=", dat, thresh)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
[1,] TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
[3,] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
[5,] TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[6,] TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[7,] TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[9,] TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[10,] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
and it is those logical values that are used to select the observations that meet the threshold. You can a different binary operator to the one I used; see ?">"
for the various options. When writing the mapply()
call, think of it in terms of left-hand-side and right-hand-side of the binary operator, such that an mapply()
call would give:
mapply(">", lhs, rhs)
where we might write
lhs > rhs
Update: As @DWin has answered the comment about two thresholds I will update my Answer to match.
thresh1 <- seq(0.05, 0.5, length.out = 10)
thresh2 <- seq(0.55, 0.95, length.out = 10)
set.seed(42)
dat <- data.frame(matrix(runif(100), ncol = 10))
l1 <- mapply(">", dat, thresh1)
l2 <- mapply("<", dat, thresh2)
We can see which elements match both constraints:
> l1 & l2
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
[1,] FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE
[2,] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
[3,] TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
[4,] FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
[5,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
[6,] TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE
[7,] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[8,] TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
[9,] FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
[10,] FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
and the same construct can be used to select those elements that match:
dat[l1 & l2] <- 0
dat
> dat
X1 X2 X3 X4 X5 X6 X7 X8
1 0.9148060 0.0000000 0.90403139 0.737595618 0.00000000 0.00000000 0.0000000 0.042988796
2 0.9370754 0.7191123 0.13871017 0.811055141 0.00000000 0.00000000 0.9828172 0.140479094
3 0.0000000 0.9346722 0.98889173 0.000000000 0.03743103 0.00000000 0.0000000 0.216385415
4 0.8304476 0.0000000 0.94666823 0.685169729 0.97353991 0.78469278 0.0000000 0.000000000
5 0.6417455 0.0000000 0.08243756 0.003948339 0.00000000 0.03893649 0.8496897 0.197410342
6 0.0000000 0.9400145 0.00000000 0.832916080 0.95757660 0.00000000 0.1894739 0.000000000
7 0.7365883 0.9782264 0.00000000 0.007334147 0.88775491 0.00000000 0.2712866 0.007884739
8 0.0000000 0.0000000 0.90573813 0.000000000 0.00000000 0.17126433 0.8281585 0.375489965
9 0.6569923 0.0000000 0.00000000 0.906601408 0.97096661 0.26108796 0.0000000 0.000000000
10 0.7050648 0.0000000 0.83600426 0.000000000 0.00000000 0.00000000 0.2405447 0.001570554
X9 X10
1 0.00000000 0.0000000000
2 0.15790521 0.0002388966
3 0.35902831 0.2085699569
4 0.00000000 0.0000000000
5 0.00000000 0.0000000000
6 0.00000000 0.0000000000
7 0.23370340 0.3330719834
8 0.08998052 0.0000000000
9 0.08561206 0.0000000000
10 0.30521837 0.0000000000
回答2:
I like Gavin's answer better than mine, but here's a slightly different application of mapply
using his data:
mapply(function(x,tt) ifelse(x >= tt, 0, x), dat, thresh)
In light of your second comment: my construction might be more generalizable than Gavin's
Two threshold vectors:
mapply(function(x, lt, ht) ifelse(x <= lt | x >= ht , 0, x), dat, lothresh, hithresh)
回答3:
Not sure how it's going to work with data frames, but the following worked with matrices:
You can get a boolean representation of df
under the given condition and then use it as indexing of df
to set the values. Alternatively you can get a vector with indexes of the matching fields and use it as index vector to set the values. Hope that helps.
来源:https://stackoverflow.com/questions/10899823/r-subscript-of-dataframe-with-condition-values-by-a-vector