How to recode dataframe values to keep only those that satisfy a certain set, replace others with “other”

问题

I'm looking for a concise solution, preferably using dplyr, to clean up values in a dataframe column so that I can keep as they are values that match a certain set, but others that don't match will be recoded as "other".

Example

I have a dataframe with names of animals. There are 4 legit animal names, but other rows contain gibberish rather than names. I want to clean the column up, to keep only the legit animal names: zebra, lion, cow, or cat.

Data

library(tidyverse)
library(stringi)

real_animals_names <- sample(c("zebra", "cow", "lion", "cat"), size = 50, replace = TRUE)
gibberish <- do.call(paste0, Map(stri_rand_strings, n = 50, length=c(5, 4, 1),
                                 pattern = c('[a-z]', '[0-9]', '[A-Z]')))

df <- tibble(animals = sample(c(animals, gibberish)))

> df

## # A tibble: 100 x 1
##    animals   
##    <chr>     
##  1 zebra     
##  2 zebra     
##  3 rbzal0677O
##  4 lion      
##  5 cat       
##  6 cfsgt0504G
##  7 cat       
##  8 jhixe2566V
##  9 lion      
## 10 zebra     
## # ... with 90 more rows

One way to solve the problem -- which I find annoying and not concise

Using dplyr 1.0.2

df %>%
  mutate(across(animals, recode,
                "lion" = "lion",
                "zebra" = "zebra",
                "cow" = "cow",
                "cat" = "cat",
                .default = "other"))

This gets it done, but this code repeats each animal name twice, and I find it clunky. Is there a cleaner solution, preferably using dplyr?

EDIT GIVEN SUGGESTED ANSWERS BELOW

Since I do like the readability of dplyr::recode, but dislike having to repeat each animal name twice; and since the answers below utilize %in% – could I incorporate %in% in my own recode solution to make it simpler/more concise?

回答1:

A base solution:

keep_names <- c('lion', 'zebra', 'cow', 'cat')

within(df, animals[!animals %in% keep_names] <- "other")

A dplyr option with replace():

library(tidyverse)

df %>%
  mutate(animals = replace(animals, !animals %in% keep_names, "other"))

With recode(), you can use a named character vector for unquote splicing with !!!.

df %>%
  mutate(animals = recode(animals, !!!set_names(keep_names), .default = "other"))

Note: set_names(keep_names) is equivalent to setNames(keep_names, keep_names).

回答2:

You could keep the animals that you need as it is and turn the rest to "Others" :

library(dplyr)

keep_names <- c('lion', 'zebra', 'cow', 'cat')

df %>% mutate(animals = ifelse(animals %in% keep_names, animals, 'Others'))

回答3:

I know you asked preferably for a dplyr solution but here a data.table solution (note that I changed the tibble() call to data.table()):

library(stringi)
library(data.table)

real_animals_names <- sample(c("zebra", "cow", "lion", "cat"), size = 50, replace = TRUE)
gibberish <- do.call(paste0, Map(stri_rand_strings, n = 50, length=c(5, 4, 1),
                                 pattern = c('[a-z]', '[0-9]', '[A-Z]')))

df <- data.table(animals = sample(c(real_animals_names, gibberish)))

keep_names <- c("lion", "zebra", "cow", "cat")
df[!animals %in% keep_names, animals := "other"]

来源：https://stackoverflow.com/questions/63916316/how-to-recode-dataframe-values-to-keep-only-those-that-satisfy-a-certain-set-re

标签

dplyr

recode