Fuzzy match row in one column with same row in next column

问题

I would like to find information in one column based on the other column. So I have some words in one column and complete sentences in another. I would like to know whether it finds the words in those sentences. But sometimes the words are not the same so I cannot use the SQL like function. Thus I think fuzzy matching + some sort of 'like' function would be helpful as the data looks like this:

Names                    Sentences
Airplanes Sarl           Airplanes-Sàrl is part of Airplanes-Group Sarl. 
Kidco Ltd.               100% ownership of Kidco.Ltd. is the mother company.
Popsi Co.                Cola Inc. is 50% share of PopsiCo which is part of LaLo.

The data has about 2,000 rows which need a logic to find whether Airplanes Sarl is indeed in the sentence or not, and it also goes for Kidco Ltd. which is in the sentence as 'Kidco.Ltd'.

To simplify matters, I do not need it to search for ALL sentences in the column, it only needs to look for the word Kidco Ltd. and search for it in the same row of the dataframe.

I have already tried it in Python with: df.apply(lambda s: fuzz.ratio(s['Names'], s['Sentences']), axis=1)

But I got a lot of unicode /ascii errors so I gave up and would like to try in R. Any suggestions on how to go about this in R? I have seen answers on Stackoverflow that would fuzzy match all sentences in the column, which is different from what I want. Any suggestions?

回答1:

Maybe try tokenization + phonetic matching:

library(RecordLinkage)
library(quanteda)
df <- read.table(header=T, sep=";", text="
Names                    ;Sentences
Airplanes Sarl           ;Airplanes-Sàrl is part of Airplanes-Group Sarl. 
Kidco Ltd.               ;Airplanes-Sàrl is part of Airplanes-Group Sarl. 
Kidco Ltd.               ;100% ownership of Kidco.Ltd. is the mother company.
Popsi Co.                ;Cola Inc. is 50% share of PopsiCo which is part of LaLo.
Popsi Co.                ;Cola Inc. is 50% share of Popsi Co which is part of LaLo.")
f <- soundex
tokens <- tokenize(as.character(df$Sentences), ngrams = 1:2) # 2-grams to catch "Popsi Co"
tokens <- lapply(tokens, f)
mapply(is.element, soundex(df$Names), tokens)
 # A614  K324  K324  P122  P122 
 # TRUE FALSE  TRUE  TRUE  TRUE

回答2:

Here's a solution using the method I suggested in the comments, in this example it works well:

library("stringdist")

df <- as.data.frame(matrix(c("Airplanes Sarl","Airplanes-Sàrl is part of Airplanes-Group Sarl.",
                             "Kidco Ltd.","100% ownership of Kidco.Ltd. is the mother company.",
                             "Popsi Co.","Cola Inc. is 50% share of PopsiCo which is part of LaLo.",
                             "some company","It is a truth universally acknowledged...",
                             "Hello world",list(NULL)),
                     ncol=2,byrow=TRUE,dimnames=list(NULL,c("Names","Sentences"))),stringsAsFactors=FALSE)

null_elements <- which(sapply(df$Sentences,is.null))
df$Sentences[null_elements] <- "" # replacing NULLs to avoid errors
df$dist <- mapply(stringdist,df$Names,df$Sentences)
df$n2 <- nchar(df$Sentences)
df$n1 <- nchar(df$Names)
df$match_quality <- df$dist-(df$n2-df$n1)
cutoff <- 2
df$match <- df$match_quality <= cutoff
df$Sentences[null_elements] <- list(NULL) # setting null elements back to initial value
df$match[null_elements] <- NA # optional, set to FALSE otherwise, as it will prevent some false positives if Names is shorter than cutoff

# Names                                                Sentences dist n2 n1 match_quality match
# 1 Airplanes Sarl          Airplanes-Sàrl is part of Airplanes-Group Sarl.   33 47 14             0  TRUE
# 2     Kidco Ltd.      100% ownership of Kidco.Ltd. is the mother company.   42 51 10             1  TRUE
# 3      Popsi Co. Cola Inc. is 50% share of PopsiCo which is part of LaLo.   48 56  9             1  TRUE
# 4   some company                It is a truth universally acknowledged...   36 41 12             7 FALSE
# 5    Hello world                                                     NULL   11  0 11            22    NA

来源：https://stackoverflow.com/questions/44244948/fuzzy-match-row-in-one-column-with-same-row-in-next-column

标签

fuzzy-comparison