stringdist

Fast Levenshtein distance in R?

泪湿孤枕 提交于 2019-12-17 09:36:05
问题 Is there a package that contains Levenshtein distance counting function which is implemented as a C or Fortran code? I have many strings to compare and stringMatch from MiscPsycho is too slow for this. 回答1: levenshteinDist (from the RecordLinkage package) calls compiled C code. Give it a try. 回答2: And stringdist in the stringdist package does it too, even faster than levenshteinDist under certain conditions (1) 回答3: You could try stringDist from Biostrings as well 来源: https://stackoverflow

Displaying corresponding values in data frame in R

一个人想着一个人 提交于 2019-12-11 12:27:44
问题 Please check the code below, I have created a data frame using three variables below, the variable "y123" computes the similarity between columns a2 with a1. The variable "y123" gives me total 16 values where every a1 value gets compared with a2. My need is that when a particular "a1" value is compared with a particular "a2" value, I want the corresponding "a3" value next to "a2" be displayed besides. So the result should be a data frame with column y123 and a second column with corresponding

Find matching groups of strings in R

痞子三分冷 提交于 2019-12-11 05:32:58
问题 I have a vector of about 8000 strings. Each element in the vector is a company name. My Objective My objective is to cluster these company names into groups, so that each cluster contains a group of company names that are similar to each other (For example: ROYAL DUTCH SHELL, SHELL USA, BMCC SHELL etc... will belong to the same group/cluster, as they are all Shell-based companies i.e. they have the word 'Shell' in their names). When dealing with a vector of this size, it seems to be taking

I'm trying to use the “stringdist” to fuzzy match company names between two data frames, but it's not working very good, what can be done?

主宰稳场 提交于 2019-12-08 06:06:58
问题 I have a data frame with 5 million different company names, many of them refer to the same company spelled in different ways or with misspellings. I use a company name "Amminex" as an example here and then try to stringdist it to the 5 million company names: Companylist <- data.frame(Companies=c('AMMINEX')) This is my big list of company names that I open: Biglist <- data.frame(name=c(Biglist[,])) I put AMMINEX and the 5 million companies in one matrix: Matches <- expand.grid(Companylist

In R - fastest way pairwise comparing character strings on similarity

丶灬走出姿态 提交于 2019-12-04 21:59:33
I'm looking for a way to speed up the following approach. Any pointers are very welcome. Where are the bottlenecks? Say I have the following data.frame : df <- data.frame(names=c("A ADAM", "S BEAN", "A APPLE", "J BOND", "J BOND"), v1=c("Test_a", "Test_b", "Test_a", "Test_b", "Test_b"), v2=c("Test_c", "Test_c", "Test_d", "Test_d", "Test_d")) I want to compare each pair of rows in df on their JaroWinkler similarity. With some help of others ( see this post ), I've been able to construct this code: #columns to compare testCols <- c("names", "v1", "v2") #compare pairs RowCompare= function(x){ comp

Passing arguments into multiple match_fun functions in R fuzzyjoin::fuzzy_join

ぐ巨炮叔叔 提交于 2019-12-04 18:14:46
I was answering these two questions and got an adequate solution, but I had trouble passing arguments using fuzzy_join into the match_fun that I extracted from fuzzyjoin::stringdist_join . In this case, I'm using a mix of multiple match_fun's, including this customized match_fun_stringdist and also == and <= for exact and criteria matching. The error message I'm getting is: # Error in mf(rep(u_x, n_y), rep(u_y, each = n_x), ...): object 'ignore_case' not found # Data: library(data.table, quietly = TRUE) Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR,

Computing the Levenshtein ratio of each element of a data.table with each value of a reference table and merge with maximum ratio

烂漫一生 提交于 2019-12-03 09:11:08
I have a data.table dt with 3 columns: id name as string threshold as num A sample is: dt <- <- data.table(nid = c("n1","n2", "n3", "n4"), rname = c("apple", "pear", "banana", "kiwi"), maxr = c(0.5, 0.8, 0.7, 0.6)) nid | rname | maxr n1 | apple | 0.5 n2 | pear | 0.8 n3 | banana | 0.7 n4 | kiwi | 0.6 I have a second table dt.ref with 2 columns: id name as string A sample is: dt.ref <- <- data.table(cid = c("c1", "c2", "c3", "c4", "c5", "c6"), cname = c("apple", "maple", "peer", "dear", "bonobo", "kiwis")) cid | cname c1 | apple c2 | maple c3 | peer c4 | dear c5 | bonobo c6 | kiwis For each

R String match for address using stringdist, stringdistmatrix

谁都会走 提交于 2019-12-01 14:12:13
I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller data set are present in the large one. As you would imagine address can be written in different ways and in different cases so it is quite annoying to see that there is not a match when it should have matched and there is a match when it should not have matched. I did some research and figured out the package stringdist that can be used. However I am stuck and I feel I am not using to its fullest capabilities and some

R String match for address using stringdist, stringdistmatrix

ε祈祈猫儿з 提交于 2019-12-01 12:14:53
问题 I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller data set are present in the large one. As you would imagine address can be written in different ways and in different cases so it is quite annoying to see that there is not a match when it should have matched and there is a match when it should not have matched. I did some research and figured out the package stringdist

Fast Levenshtein distance in R?

最后都变了- 提交于 2019-11-27 07:49:25
Is there a package that contains Levenshtein distance counting function which is implemented as a C or Fortran code? I have many strings to compare and stringMatch from MiscPsycho is too slow for this. George Dontas levenshteinDist (from the RecordLinkage package) calls compiled C code. Give it a try. And stringdist in the stringdist package does it too, even faster than levenshteinDist under certain conditions ( 1 ) Aaron Statham You could try stringDist from Biostrings as well 来源: https://stackoverflow.com/questions/3182091/fast-levenshtein-distance-in-r