stringdist | 易学教程

Fast Levenshtein distance in R?

阅读更多关于 Fast Levenshtein distance in R?

问题 Is there a package that contains Levenshtein distance counting function which is implemented as a C or Fortran code? I have many strings to compare and stringMatch from MiscPsycho is too slow for this. 回答1: levenshteinDist (from the RecordLinkage package) calls compiled C code. Give it a try. 回答2: And stringdist in the stringdist package does it too, even faster than levenshteinDist under certain conditions (1) 回答3: You could try stringDist from Biostrings as well 来源： https://stackoverflow

Displaying corresponding values in data frame in R

阅读更多关于 Displaying corresponding values in data frame in R

问题 Please check the code below, I have created a data frame using three variables below, the variable "y123" computes the similarity between columns a2 with a1. The variable "y123" gives me total 16 values where every a1 value gets compared with a2. My need is that when a particular "a1" value is compared with a particular "a2" value, I want the corresponding "a3" value next to "a2" be displayed besides. So the result should be a data frame with column y123 and a second column with corresponding

Find matching groups of strings in R

阅读更多关于 Find matching groups of strings in R

问题 I have a vector of about 8000 strings. Each element in the vector is a company name. My Objective My objective is to cluster these company names into groups, so that each cluster contains a group of company names that are similar to each other (For example: ROYAL DUTCH SHELL, SHELL USA, BMCC SHELL etc... will belong to the same group/cluster, as they are all Shell-based companies i.e. they have the word 'Shell' in their names). When dealing with a vector of this size, it seems to be taking

I'm trying to use the “stringdist” to fuzzy match company names between two data frames, but it's not working very good, what can be done?

阅读更多关于 I'm trying to use the “stringdist” to fuzzy match company names between two data frames, but it's not working very good, what can be done?

问题 I have a data frame with 5 million different company names, many of them refer to the same company spelled in different ways or with misspellings. I use a company name "Amminex" as an example here and then try to stringdist it to the 5 million company names: Companylist <- data.frame(Companies=c('AMMINEX')) This is my big list of company names that I open: Biglist <- data.frame(name=c(Biglist[,])) I put AMMINEX and the 5 million companies in one matrix: Matches <- expand.grid(Companylist

In R - fastest way pairwise comparing character strings on similarity

阅读更多关于 In R - fastest way pairwise comparing character strings on similarity

I'm looking for a way to speed up the following approach. Any pointers are very welcome. Where are the bottlenecks? Say I have the following data.frame : df <- data.frame(names=c("A ADAM", "S BEAN", "A APPLE", "J BOND", "J BOND"), v1=c("Test_a", "Test_b", "Test_a", "Test_b", "Test_b"), v2=c("Test_c", "Test_c", "Test_d", "Test_d", "Test_d")) I want to compare each pair of rows in df on their JaroWinkler similarity. With some help of others ( see this post ), I've been able to construct this code: #columns to compare testCols <- c("names", "v1", "v2") #compare pairs RowCompare= function(x){ comp

Passing arguments into multiple match_fun functions in R fuzzyjoin::fuzzy_join

阅读更多关于 Passing arguments into multiple match_fun functions in R fuzzyjoin::fuzzy_join

I was answering these two questions and got an adequate solution, but I had trouble passing arguments using fuzzy_join into the match_fun that I extracted from fuzzyjoin::stringdist_join . In this case, I'm using a mix of multiple match_fun's, including this customized match_fun_stringdist and also == and <= for exact and criteria matching. The error message I'm getting is: # Error in mf(rep(u_x, n_y), rep(u_y, each = n_x), ...): object 'ignore_case' not found # Data: library(data.table, quietly = TRUE) Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR,

Computing the Levenshtein ratio of each element of a data.table with each value of a reference table and merge with maximum ratio

阅读更多关于 Computing the Levenshtein ratio of each element of a data.table with each value of a reference table and merge with maximum ratio

R String match for address using stringdist, stringdistmatrix

阅读更多关于 R String match for address using stringdist, stringdistmatrix

I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller data set are present in the large one. As you would imagine address can be written in different ways and in different cases so it is quite annoying to see that there is not a match when it should have matched and there is a match when it should not have matched. I did some research and figured out the package stringdist that can be used. However I am stuck and I feel I am not using to its fullest capabilities and some

R String match for address using stringdist, stringdistmatrix

阅读更多关于 R String match for address using stringdist, stringdistmatrix

问题 I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller data set are present in the large one. As you would imagine address can be written in different ways and in different cases so it is quite annoying to see that there is not a match when it should have matched and there is a match when it should not have matched. I did some research and figured out the package stringdist

Fast Levenshtein distance in R?

阅读更多关于 Fast Levenshtein distance in R?

Is there a package that contains Levenshtein distance counting function which is implemented as a C or Fortran code? I have many strings to compare and stringMatch from MiscPsycho is too slow for this. George Dontas levenshteinDist (from the RecordLinkage package) calls compiled C code. Give it a try. And stringdist in the stringdist package does it too, even faster than levenshteinDist under certain conditions ( 1 ) Aaron Statham You could try stringDist from Biostrings as well 来源： https://stackoverflow.com/questions/3182091/fast-levenshtein-distance-in-r