I'm trying to use the “stringdist” to fuzzy match company names between two data frames, but it's not working very good, what can be done?

主宰稳场 提交于 2019-12-08 06:06:58

问题


I have a data frame with 5 million different company names, many of them refer to the same company spelled in different ways or with misspellings. I use a company name "Amminex" as an example here and then try to stringdist it to the 5 million company names:

Companylist <- data.frame(Companies=c('AMMINEX'))

This is my big list of company names that I open:

Biglist <- data.frame(name=c(Biglist[,]))

I put AMMINEX and the 5 million companies in one matrix:

Matches <- expand.grid(Companylist$Companies,Biglist$name.Companiesnames)

Change the column names:

names(Matches) <- c("Companies","CompaniesList")

I use the stringdist with the method cosine:

Matches$dist <- stringdist(Matches$Companies,Matches$CompaniesList, method="cosine")

I remove all distances that are above 0.2 to get rid of bad matches:

Matches_trimmed <- Matches[!(Matches$dist>0.2),]

I sort by the distance column so best matches appear on the top:

Matches_trimmed <- Matches_trimmed[with(Matches_trimmed, order(dist)), ]

As you can see here, the results are not very satisfactory:

The first row is good, but then a bunch of bad matches appear before finally at the bottom I get the matches "AMMINEX AS" which are good.

This doesn't really work out for me. Is there any way I can improve this fuzzy matching or maybe use a different method for better results? Maybe a method that will look the order in which the letters appear in the strings?

来源:https://stackoverflow.com/questions/49586356/im-trying-to-use-the-stringdist-to-fuzzy-match-company-names-between-two-data

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!