R - Merging two data files based on partial matching of inconsistent full name formats

落爺英雄遲暮 提交于 2019-12-07 01:58:29

For a first pass, I would suggest a two-stage process.

First, clean your strings. Normalize the casing, strip out extra spaces, strip out any unwanted characters. A function I use for a fairly aggressive cleaning is below:

stringCleaning <- function(x) {
#   x <- stringr::str_trim(x)
#   x <- tolower(x)
#   x <- gsub("\\s+", " ", x)
#   x <- gsub("[^[:space:]A-z0-9]", "", x)
  stringr::str_trim(tolower(gsub("\\s+", " ", gsub("[^[:space:]A-z0-9]", "", x))))
}

This converts strings to lowercase, strips out any non-alphanumeric or string characters, strips out extra spaces, and removes spaces on either side of the string.

Two, use Levenshtein (or edit) distances to find your closest matches. The stringdist package includes a simple distance calculator to help you.

stringdist::stringdist('your mother', c("bellow", "your mom", 'yourmother'))
min(stringdist::stringdist('your mother', c("bellow", "your mom", 'yourmother')))
which.min(stringdist::stringdist('your mother', c("bellow", "your mom", 'yourmother')))

You can use this function to find the most appropriate match in another dataframe.

df1 <- data.frame(name = c("Jena Stars", "Gina Starz"))
df2 <- data.frame(name = c("gina starz", "Jena starz  "))

df1$clean <- stringCleaning(df1$name)
df2$clean <- stringCleaning(df2$name)

df1$check <- df2$name[sapply(df1$clean, function(x) {
  which.min(stringdist::stringdist(x, df2$clean))
  })]
df1

A small example, but I hope it's helpful.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!