问题
I have a whole list of misspelling and I would like to change the all in one go. Is there an easy way to do so without writing a massive ifelse statement?
vegas <- c("North Las Vegas","N Las Vegas", "LAS VEGAS", "Las vegas","N. Las Vegas", "las vegas", "Las Vegas", "Las Vegas ", "South Las Vegas", "La Vegas", "Las Vegas, NV", "LasVegas",
"110 Las Vegas", "C Las Vegas", "Henderson and Las vegas",
"las Vegas", "Las Vegas & Henderson", "Las Vegas East", "Las Vegas Nevada",
"Las Vegas NV", "Las Vegas Valley", "Las Vegas,", "Las Vegass",
"Las Vergas", "Los Vegas", "N E Las Vegas", "N W Las Vegas", "NORTH LAS VEGAS", "North Las Vegas ", "Vegas")
data <- structure(list(city = c("Las Vegas", "Henderson", "North Las Vegas",
"Boulder City", "N Las Vegas", "Paradise", "LAS VEGAS", "Nellis AFB",
"Las vegas", "Blue Diamond", "N. Las Vegas", "Summerlin", "Spring Valley",
"HENDERSON", "las vegas", "Enterprise", "Las Vegas", "Clark",
"Las Vegas ", "Nellis Air Force Base", "South Las Vegas", "henderson",
"Nellis Afb", "La Vegas", "Las Vegas, NV", "LasVegas", "Summerlin South",
"110 Las Vegas", "Black Rock City", "boulder city", "C Las Vegas",
"Centennial Hills", "Central Henderson", "Citibank", "City Center",
"Decatur", "Green Valley", "Henderson (Green Valley)", "Henderson and Las vegas",
"Henderston", "Hendserson", "Hnederson", "Lake Las Vegas", "Lake Mead",
"las Vegas", "Las Vegas & Henderson", "Las Vegas East", "Las Vegas Nevada",
"Las Vegas NV", "Las Vegas Valley", "Las Vegas,", "Las Vegass",
"Las Vergas", "Los Vegas", "N E Las Vegas", "N W Las Vegas",
"Nellis", "NELLIS AFB", "Nevada", "NORTH LAS VEGAS", "North Las Vegas ",
"Pahrump", "Seven Hills", "Sunrise", "Sunrise Manor", "Vegas",
"W Henderson", "W Spring Valley", "Whitney"), count = c(29361L,
4892L, 1547L, 269L, 26L, 24L, 19L, 16L, 14L, 12L, 12L, 11L, 9L,
8L, 8L, 7L, 5L, 4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), row.names = c(NA, -69L), class = c("tbl_df",
"tbl", "data.frame"))
So correct spelling in each mispelled row to "Las Vegas".
回答1:
Below is a solution very similar to the proposed mgsub
approach (with base R functions) (perhaps you might want to add Lake Las Vegas to your list):
vegas <- c("North Las Vegas","N Las Vegas", "LAS VEGAS", "Las vegas","N. Las Vegas", "las vegas", "Las Vegas", "Las Vegas ", "South Las Vegas", "La Vegas", "Las Vegas, NV", "LasVegas",
"110 Las Vegas", "C Las Vegas", "Henderson and Las vegas",
"las Vegas", "Las Vegas & Henderson", "Las Vegas East", "Las Vegas Nevada",
"Las Vegas NV", "Las Vegas Valley", "Las Vegas,", "Las Vegass",
"Las Vergas", "Los Vegas", "N E Las Vegas", "N W Las Vegas", "NORTH LAS VEGAS", "North Las Vegas ", "Vegas")
data <- structure(list(city = c("Las Vegas", "Henderson", "North Las Vegas",
"Boulder City", "N Las Vegas", "Paradise", "LAS VEGAS", "Nellis AFB",
"Las vegas", "Blue Diamond", "N. Las Vegas", "Summerlin", "Spring Valley",
"HENDERSON", "las vegas", "Enterprise", "Las Vegas", "Clark",
"Las Vegas ", "Nellis Air Force Base", "South Las Vegas", "henderson",
"Nellis Afb", "La Vegas", "Las Vegas, NV", "LasVegas", "Summerlin South",
"110 Las Vegas", "Black Rock City", "boulder city", "C Las Vegas",
"Centennial Hills", "Central Henderson", "Citibank", "City Center",
"Decatur", "Green Valley", "Henderson (Green Valley)", "Henderson and Las vegas",
"Henderston", "Hendserson", "Hnederson", "Lake Las Vegas", "Lake Mead",
"las Vegas", "Las Vegas & Henderson", "Las Vegas East", "Las Vegas Nevada",
"Las Vegas NV", "Las Vegas Valley", "Las Vegas,", "Las Vegass",
"Las Vergas", "Los Vegas", "N E Las Vegas", "N W Las Vegas",
"Nellis", "NELLIS AFB", "Nevada", "NORTH LAS VEGAS", "North Las Vegas ",
"Pahrump", "Seven Hills", "Sunrise", "Sunrise Manor", "Vegas",
"W Henderson", "W Spring Valley", "Whitney"), count = c(29361L,
4892L, 1547L, 269L, 26L, 24L, 19L, 16L, 14L, 12L, 12L, 11L, 9L,
8L, 8L, 7L, 5L, 4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), row.names = c(NA, -69L), class = c("tbl_df",
"tbl", "data.frame"))
## function that takes list with two elements and replaces first with second
multisub <- function(replacement.list, string, ...) {
mygsub <- function(l, x) gsub(pattern = l[1], replacement = l[2], x, ...)
Reduce(mygsub, replacement.list, init = string, right = TRUE)
}
## make sure the matches correspond to entire string by adding delimiters
vegas <- paste0("^", vegas, "$")
## generate replacement list
mylist <- unlist(apply(cbind(vegas, rep("Las Vegas", length(vegas))), 1, list), recursive = FALSE)
## perform multiple replacement
data$city_replaced <- multisub(mylist, data$city)
data
#> city count city_replaced
#> 1 Las Vegas 29361 Las Vegas
#> 2 Henderson 4892 Henderson
#> 3 North Las Vegas 1547 Las Vegas
#> 4 Boulder City 269 Boulder City
#> 5 N Las Vegas 26 Las Vegas
#> 6 Paradise 24 Paradise
#> 7 LAS VEGAS 19 Las Vegas
#> 8 Nellis AFB 16 Nellis AFB
#> 9 Las vegas 14 Las Vegas
#> 10 Blue Diamond 12 Blue Diamond
#> 11 N. Las Vegas 12 Las Vegas
#> 12 Summerlin 11 Summerlin
#> 13 Spring Valley 9 Spring Valley
#> 14 HENDERSON 8 HENDERSON
#> 15 las vegas 8 Las Vegas
#> 16 Enterprise 7 Enterprise
#> 17 Las Vegas 5 Las Vegas
#> 18 Clark 4 Clark
#> 19 Las Vegas 4 Las Vegas
#> 20 Nellis Air Force Base 4 Nellis Air Force Base
#> 21 South Las Vegas 4 Las Vegas
#> 22 henderson 3 henderson
#> 23 Nellis Afb 3 Nellis Afb
#> 24 La Vegas 2 Las Vegas
#> 25 Las Vegas, NV 2 Las Vegas
#> 26 LasVegas 2 Las Vegas
#> 27 Summerlin South 2 Summerlin South
#> 28 110 Las Vegas 1 Las Vegas
#> 29 Black Rock City 1 Black Rock City
#> 30 boulder city 1 boulder city
#> 31 C Las Vegas 1 Las Vegas
#> 32 Centennial Hills 1 Centennial Hills
#> 33 Central Henderson 1 Central Henderson
#> 34 Citibank 1 Citibank
#> 35 City Center 1 City Center
#> 36 Decatur 1 Decatur
#> 37 Green Valley 1 Green Valley
#> 38 Henderson (Green Valley) 1 Henderson (Green Valley)
#> 39 Henderson and Las vegas 1 Las Vegas
#> 40 Henderston 1 Henderston
#> 41 Hendserson 1 Hendserson
#> 42 Hnederson 1 Hnederson
#> 43 Lake Las Vegas 1 Lake Las Vegas
#> 44 Lake Mead 1 Lake Mead
#> 45 las Vegas 1 Las Vegas
#> 46 Las Vegas & Henderson 1 Las Vegas
#> 47 Las Vegas East 1 Las Vegas
#> 48 Las Vegas Nevada 1 Las Vegas
#> 49 Las Vegas NV 1 Las Vegas
#> 50 Las Vegas Valley 1 Las Vegas
#> 51 Las Vegas, 1 Las Vegas
#> 52 Las Vegass 1 Las Vegas
#> 53 Las Vergas 1 Las Vegas
#> 54 Los Vegas 1 Las Vegas
#> 55 N E Las Vegas 1 Las Vegas
#> 56 N W Las Vegas 1 Las Vegas
#> 57 Nellis 1 Nellis
#> 58 NELLIS AFB 1 NELLIS AFB
#> 59 Nevada 1 Nevada
#> 60 NORTH LAS VEGAS 1 Las Vegas
#> 61 North Las Vegas 1 Las Vegas
#> 62 Pahrump 1 Pahrump
#> 63 Seven Hills 1 Seven Hills
#> 64 Sunrise 1 Sunrise
#> 65 Sunrise Manor 1 Sunrise Manor
#> 66 Vegas 1 Las Vegas
#> 67 W Henderson 1 W Henderson
#> 68 W Spring Valley 1 W Spring Valley
#> 69 Whitney 1 Whitney
Created on 2020-03-10 by the reprex package (v0.3.0)
Edit:
With the above approach you can append multiple replacement lists and replace them at once. It also allows partial matching, although we have explicitly turned it off here using vegas <- paste0("^", vegas, "$")
.
If you have just one city and a list of alternative spellings, you could also simply match them up and replace them (using your original data
data.frame and vegas
vector):
data$city[data$city %in% vegas] <- "Las Vegas"
回答2:
I don't fully understand your example, but you can check for close matches (such as minor mispellings) using the Levenshtein Distance. See here for examples in R: https://www.r-bloggers.com/natural-language-processing-in-r-edit-distance/
来源:https://stackoverflow.com/questions/60610601/how-to-correct-list-of-mispellings-at-once-in-r