问题
I would like to find information in one column based on the other column. So I have some words in one column and complete sentences in another. I would like to know whether it finds the words in those sentences. But sometimes the words are not the same so I cannot use the SQL like
function. Thus I think fuzzy matching + some sort of 'like' function would be helpful as the data looks like this:
Names Sentences
Airplanes Sarl Airplanes-Sàrl is part of Airplanes-Group Sarl.
Kidco Ltd. 100% ownership of Kidco.Ltd. is the mother company.
Popsi Co. Cola Inc. is 50% share of PopsiCo which is part of LaLo.
The data has about 2,000 rows which need a logic to find whether Airplanes Sarl is indeed in the sentence or not, and it also goes for Kidco Ltd. which is in the sentence as 'Kidco.Ltd'.
To simplify matters, I do not need it to search for ALL sentences in the column, it only needs to look for the word Kidco Ltd. and search for it in the same row of the dataframe.
I have already tried it in Python with: df.apply(lambda s: fuzz.ratio(s['Names'], s['Sentences']), axis=1)
But I got a lot of unicode /ascii errors so I gave up and would like to try in R. Any suggestions on how to go about this in R? I have seen answers on Stackoverflow that would fuzzy match all sentences in the column, which is different from what I want. Any suggestions?
回答1:
Maybe try tokenization + phonetic matching:
library(RecordLinkage)
library(quanteda)
df <- read.table(header=T, sep=";", text="
Names ;Sentences
Airplanes Sarl ;Airplanes-Sàrl is part of Airplanes-Group Sarl.
Kidco Ltd. ;Airplanes-Sàrl is part of Airplanes-Group Sarl.
Kidco Ltd. ;100% ownership of Kidco.Ltd. is the mother company.
Popsi Co. ;Cola Inc. is 50% share of PopsiCo which is part of LaLo.
Popsi Co. ;Cola Inc. is 50% share of Popsi Co which is part of LaLo.")
f <- soundex
tokens <- tokenize(as.character(df$Sentences), ngrams = 1:2) # 2-grams to catch "Popsi Co"
tokens <- lapply(tokens, f)
mapply(is.element, soundex(df$Names), tokens)
# A614 K324 K324 P122 P122
# TRUE FALSE TRUE TRUE TRUE
回答2:
Here's a solution using the method I suggested in the comments, in this example it works well:
library("stringdist")
df <- as.data.frame(matrix(c("Airplanes Sarl","Airplanes-Sàrl is part of Airplanes-Group Sarl.",
"Kidco Ltd.","100% ownership of Kidco.Ltd. is the mother company.",
"Popsi Co.","Cola Inc. is 50% share of PopsiCo which is part of LaLo.",
"some company","It is a truth universally acknowledged...",
"Hello world",list(NULL)),
ncol=2,byrow=TRUE,dimnames=list(NULL,c("Names","Sentences"))),stringsAsFactors=FALSE)
null_elements <- which(sapply(df$Sentences,is.null))
df$Sentences[null_elements] <- "" # replacing NULLs to avoid errors
df$dist <- mapply(stringdist,df$Names,df$Sentences)
df$n2 <- nchar(df$Sentences)
df$n1 <- nchar(df$Names)
df$match_quality <- df$dist-(df$n2-df$n1)
cutoff <- 2
df$match <- df$match_quality <= cutoff
df$Sentences[null_elements] <- list(NULL) # setting null elements back to initial value
df$match[null_elements] <- NA # optional, set to FALSE otherwise, as it will prevent some false positives if Names is shorter than cutoff
# Names Sentences dist n2 n1 match_quality match
# 1 Airplanes Sarl Airplanes-Sàrl is part of Airplanes-Group Sarl. 33 47 14 0 TRUE
# 2 Kidco Ltd. 100% ownership of Kidco.Ltd. is the mother company. 42 51 10 1 TRUE
# 3 Popsi Co. Cola Inc. is 50% share of PopsiCo which is part of LaLo. 48 56 9 1 TRUE
# 4 some company It is a truth universally acknowledged... 36 41 12 7 FALSE
# 5 Hello world NULL 11 0 11 22 NA
来源:https://stackoverflow.com/questions/44244948/fuzzy-match-row-in-one-column-with-same-row-in-next-column