问题
I am trying to detect matches between an open text field (read: messy!) with a vector of names. I created a silly fruit example that highlights my main challenges.
df1 <- data.frame(id = c(1, 2, 3, 4, 5, 6),
entry = c("Apple",
"I love apples",
"appls",
"Bannanas",
"banana",
"An apple a day keeps..."))
df1$entry <- as.character(df1$entry)
df2 <- data.frame(fruit=c("apple",
"banana",
"pineapple"),
code=c(11, 12, 13))
df2$fruit <- as.character(df2$fruit)
df1 %>%
mutate(match = str_detect(str_to_lower(entry),
str_to_lower(df2$fruit)))
My approach grabs the low hanging fruit, if you will (exact matches for "Apple" and "banana").
# id entry match
#1 1 Apple TRUE
#2 2 I love apples FALSE
#3 3 appls FALSE
#4 4 Bannanas FALSE
#5 5 banana TRUE
#6 6 An apple a day keeps... FALSE
The unmatched cases have different challenges:
- The target fruit in cases 2 and 6 are embedded in larger strings.
- The target fruit in 3 and 4 require a fuzzy match.
The fuzzywuzzyR package is great and does a pretty good job (see page for details on installing python modules).
library(fuzzywuzzyR)
choices <- df2$fruit
word <- df1$entry[3] # "appls"
init_proc = FuzzUtils$new()
PROC = init_proc$Full_process
PROC1 = tolower
init_scor = FuzzMatcher$new()
SCOR = init_scor$WRATIO
init <- FuzzExtract$new()
init$Extract(string = word,
sequence_strings = choices,
processor = PROC,
scorer = SCOR)
This setup returns a score of 80 for "apple" (the highest).
Is there another approach to consider aside from fuzzywuzzyR
? How would you tackle this problem?
Adding fuzzywuzzyR
output:
[[1]]
[[1]][[1]]
[1] "apple"
[[1]][[2]]
[1] 80
[[2]]
[[2]][[1]]
[1] "pineapple"
[[2]][[2]]
[1] 72
[[3]]
[[3]][[1]]
[1] "banana"
[[3]][[2]]
[1] 18
回答1:
I found this question referenced while answering a question today. So I thought of answering the original question.
library(dplyr)
library(fuzzyjoin)
df1 %>%
stringdist_left_join(df2, by=c(entry="fruit"), ignore_case=T, method="jw", distance_col="dist") %>%
group_by(entry) %>%
top_n(-1) %>%
select(-dist)
Output is:
id entry fruit code
<dbl> <fct> <fct> <dbl>
1 1.00 Apple apple 11.0
2 2.00 I love apples pineapple 13.0
3 3.00 appls apple 11.0
4 4.00 Bannanas banana 12.0
5 5.00 banana banana 12.0
6 6.00 An apple a day keeps... apple 11.0
Sample data:
df1 <- data.frame(id = c(1, 2, 3, 4, 5, 6),
entry = c("Apple", "I love apples", "appls", "Bannanas", "banana", "An apple a day keeps..."))
df2 <- data.frame(fruit=c("apple", "banana", "pineapple"), code=c(11, 12, 13))
来源:https://stackoverflow.com/questions/47271685/fuzzy-matching-in-r