问题
I'm looking to replace entire string entries within data based on partial matches using functions in the stringr
package.
The only method I've tried has been replacing exact matches using str_replace_all()
but this becomes tedious and unwieldy when there are dozens of variations to correct for. I'm looking to replace based on partial matches. In my reprex below, I replace variants of "Spaniard" and "Colombian" by direct specification. However, I would love to perform those replacements based on something like meeting the condition that "Spa" or "Col" exists in the words.
library(tidyverse)
library(stringr)
data <- c(
"Spanish",
"SPANIARD",
"Spainiard",
"Colombian",
"Columbian",
"Ecuador",
"Equador",
"Ecuadorian",
"VENEZUELAN"
)
str_replace_all(data,
c(
"Spanish" = "Spaniard",
"SPANIARD" = "Spaniard",
"Spainiard" = "Spaniard",
"Columbian" = "Colombian"
))
#> [1] "Spaniard" "Spaniard" "Spaniard" "Colombian" "Colombian"
#> [6] "Ecuador" "Equador" "Ecuadorian" "VENEZUELAN"
Created on 2019-05-21 by the reprex package (v0.2.1)
So str_replace_all()
works as advertised, but I'm looking for a way to streamline this process in the tidyverse. Any help is much appreciated.
回答1:
I prefer to use a distance measure (e.g., Jaro-winkler's distance, or some other distance measure), but they do have their drawbacks. Be weary of what you could be changing with partial matching. If you are doing partial matching it would be wise to see what the possibilities are. But, you can do what you outlined in tidyverse using case_when
with startsWith
or grepl
:
tibble(data = data) %>%
mutate(
v1 = tolower(data),
new_name = case_when(
startsWith(v1, "spa") ~ "Spanaird",
startsWith(v1, "col") ~ "Colombian",
startsWith(v1, "eq") | startsWith(v1, "ec") ~ "Equadorian",
startsWith(v1, "ven") ~ "Venezuelan",
TRUE ~ as.character(data)))
# A tibble: 9 x 3
data v1 new_name
<chr> <chr> <chr>
1 Spanish spanish Spanaird
2 SPANIARD spaniard Spanaird
3 Spainiard spainiard Spanaird
4 Colombian colombian Colombian
5 Columbian columbian Colombian
6 Ecuador ecuador Equadorian
7 Equador equador Equadorian
8 Ecuadorian ecuadorian Equadorian
9 VENEZUELAN venezuelan Venezuelan
To see the possibilities you can do this (or several other things):
tibble(data = data) %>%
arrange(data) %>%
count(tolower(data))
回答2:
An option would be to use distance method for partial matching
vals <- c("Spaniard", "Equador", "Colombian", "Venezuelan")
library(stringdist)
vals[amatch(tolower(data), tolower(vals),maxDist=5)]
#[1] "Spaniard" "Spaniard" "Spaniard" "Colombian" "Colombian"
#[6] "Equador" "Equador" "Equador" "Venezuelan"
It can be piped in a tidyverse
work flow
library(tidyverse)
tibble(v1 = data) %>%
mutate(v1 = vals[amatch(tolower(v1), tolower(vals), maxDist = 5)])
来源:https://stackoverflow.com/questions/56240930/tidyverse-replacing-entire-strings-based-on-partial-matches