If values in a range of columns aren't present in another column, replace with NA

问题

I have a dataset that includes some non-referenced data that I would like to replace with NA. In the following example, if the data in columns rep1 to rep4 does not match one of the values in the ID column, I would like to replace the value with NA. In this case, the values of x, y, and z aren't listed in the ID column, so they should be replaced.

This is a somewhat similar question that I asked earlier here : If data present, replace with data from another column based on row ID

I think the solution will be similar to what was given in the previous question, but I don't know how to alter the second portion ~ value[match(., ID)] to return NA for values that aren't listed in the ID column.

df %>% mutate_at(vars(rep1:rep4), ~ value[match(., ID)])

ID  rep1  rep2  rep3  rep4  
a                           
b   a                       
c   a     b                 
d   a     b     c           
e   a     b     c     d     
f                           
g   x                       
h                           
i                           
j   y     z                 
k   z                       
l                           
m

The result should look like this:

ID  rep1  rep2  rep3  rep4  
a                           
b   a                       
c   a     b                 
d   a     b     c           
e   a     b     c     d     
f                           
g   NA                      
h                           
i                           
j   NA    NA                    
k   NA                      
l                           
m

Here is the data using dput()

structure(list(ID = structure(1:13, .Label = c("a", "b", "c", 
"d", "e", "f", "g", "h", "i", "j", "k", "l", "m"), class = "factor"), 
    rep1 = structure(c(1L, 2L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 4L, 
    5L, 1L, 1L), .Label = c("", "a", "x", "y", "z"), class = "factor"), 
    rep2 = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 
    1L, 1L, 1L), .Label = c("", "b", "z"), class = "factor"), 
    rep3 = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L), .Label = c("", "c"), class = "factor"), rep4 = structure(c(1L, 
    1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
    "d"), class = "factor")), class = "data.frame", row.names = c(NA, -13L))

回答1:

An dplyralternative using replace()

df %>%
  mutate_at(vars(rep1:rep4), ~replace(., which(!(. %in% ID | . == "")), NA))

   ID rep1 rep2 rep3 rep4
1   a                    
2   b    a               
3   c    a    b          
4   d    a    b    c     
5   e    a    b    c    d
6   f                    
7   g <NA>               
8   h                    
9   i                    
10  j <NA> <NA>          
11  k <NA>               
12  l                    
13  m

回答2:

We can use sapply and replace values to NA if they are not present in ID or have blank value.

df[!(sapply(df, `%in%`, df$ID) | df == '')] <- NA
df

#   ID rep1 rep2 rep3 rep4
#1   a                    
#2   b    a               
#3   c    a    b          
#4   d    a    b    c     
#5   e    a    b    c    d
#6   f                    
#7   g <NA>               
#8   h                    
#9   i                    
#10  j <NA> <NA>          
#11  k <NA>               
#12  l                    
#13  m

来源：https://stackoverflow.com/questions/58335382/if-values-in-a-range-of-columns-arent-present-in-another-column-replace-with-n

标签

dplyr

mutate