问题
I'm looking for a concise solution, preferably using dplyr
, to clean up values in a dataframe column so that I can keep as they are values that match a certain set, but others that don't match will be recoded as "other".
Example
I have a dataframe with names of animals. There are 4 legit animal names, but other rows contain gibberish rather than names. I want to clean the column up, to keep only the legit animal names: zebra
, lion
, cow
, or cat
.
Data
library(tidyverse)
library(stringi)
real_animals_names <- sample(c("zebra", "cow", "lion", "cat"), size = 50, replace = TRUE)
gibberish <- do.call(paste0, Map(stri_rand_strings, n = 50, length=c(5, 4, 1),
pattern = c('[a-z]', '[0-9]', '[A-Z]')))
df <- tibble(animals = sample(c(animals, gibberish)))
> df
## # A tibble: 100 x 1
## animals
## <chr>
## 1 zebra
## 2 zebra
## 3 rbzal0677O
## 4 lion
## 5 cat
## 6 cfsgt0504G
## 7 cat
## 8 jhixe2566V
## 9 lion
## 10 zebra
## # ... with 90 more rows
One way to solve the problem -- which I find annoying and not concise
Using dplyr 1.0.2
df %>%
mutate(across(animals, recode,
"lion" = "lion",
"zebra" = "zebra",
"cow" = "cow",
"cat" = "cat",
.default = "other"))
This gets it done, but this code repeats each animal name twice, and I find it clunky. Is there a cleaner solution, preferably using dplyr
?
EDIT GIVEN SUGGESTED ANSWERS BELOW
Since I do like the readability of dplyr::recode
, but dislike having to repeat each animal name twice; and since the answers below utilize %in%
– could I incorporate %in%
in my own recode
solution to make it simpler/more concise?
回答1:
A base
solution:
keep_names <- c('lion', 'zebra', 'cow', 'cat')
within(df, animals[!animals %in% keep_names] <- "other")
A dplyr
option with replace()
:
library(tidyverse)
df %>%
mutate(animals = replace(animals, !animals %in% keep_names, "other"))
With recode()
, you can use a named character vector for unquote splicing with !!!
.
df %>%
mutate(animals = recode(animals, !!!set_names(keep_names), .default = "other"))
Note: set_names(keep_names)
is equivalent to setNames(keep_names, keep_names)
.
回答2:
You could keep the animals that you need as it is and turn the rest to "Others"
:
library(dplyr)
keep_names <- c('lion', 'zebra', 'cow', 'cat')
df %>% mutate(animals = ifelse(animals %in% keep_names, animals, 'Others'))
回答3:
I know you asked preferably for a dplyr solution but here a data.table
solution (note that I changed the tibble()
call to data.table()
):
library(stringi)
library(data.table)
real_animals_names <- sample(c("zebra", "cow", "lion", "cat"), size = 50, replace = TRUE)
gibberish <- do.call(paste0, Map(stri_rand_strings, n = 50, length=c(5, 4, 1),
pattern = c('[a-z]', '[0-9]', '[A-Z]')))
df <- data.table(animals = sample(c(real_animals_names, gibberish)))
keep_names <- c("lion", "zebra", "cow", "cat")
df[!animals %in% keep_names, animals := "other"]
来源:https://stackoverflow.com/questions/63916316/how-to-recode-dataframe-values-to-keep-only-those-that-satisfy-a-certain-set-re