stringdist

Calculating similarity between two vectors/Strings in R

旧巷老猫 提交于 2020-01-25 06:50:12
问题 It might be similar question asked in this forum but I feel my requirement peculiar. I have a data frame df1 where it consists of variable "WrittenTerms" with 40,000 observations and I have another data-fame df2 with variable "SuggestedTerms" with 17,000 observations I need to calculate the similarity between "written Term" and "suggestedterms" df1$WrittenTerms head pain lung cancer abdminal pain df2$suggestedterms cardio attack breast cancer abdomen pain head ache lung cancer I need to get

Finding similar rows (not duplicates) in a dataframe in R

落花浮王杯 提交于 2020-01-24 18:51:06
问题 I have a dataset of >800k rows (example): id fieldA fieldB codeA codeB 120 Similar one addrs example1 929292 0006 3490 Similar oh addrs example3 929292 0006 2012 CLOSE CAA addrs example10232 kkda9a 0039 9058 CLASE CAC addrs example01232 kkda9a 0039 9058 NON DONE addrs example010193 kkda9a 0039 48848 OOO AD ADDD addrs example18238 uyMMnn 8303 Field ID is an unique id, both fields codeA and codeB must be the same, but the fields fieldA and fieldB need a Levenshtein distance or similar function.

Computing the Levenshtein ratio of each element of a data.table with each value of a reference table and merge with maximum ratio

╄→尐↘猪︶ㄣ 提交于 2020-01-01 03:59:11
问题 I have a data.table dt with 3 columns: id name as string threshold as num A sample is: dt <- <- data.table(nid = c("n1","n2", "n3", "n4"), rname = c("apple", "pear", "banana", "kiwi"), maxr = c(0.5, 0.8, 0.7, 0.6)) nid | rname | maxr n1 | apple | 0.5 n2 | pear | 0.8 n3 | banana | 0.7 n4 | kiwi | 0.6 I have a second table dt.ref with 2 columns: id name as string A sample is: dt.ref <- <- data.table(cid = c("c1", "c2", "c3", "c4", "c5", "c6"), cname = c("apple", "maple", "peer", "dear", "bonobo

efficient programming in R

故事扮演 提交于 2019-12-25 02:18:43
问题 I have a data like author_id paper_id confirmed author_name1 author_affiliation1 author_name 826 25733 1 Emanuele Buratti Genetic engineering Emanuele Buratti 826 25733 1 Emanuele Buratti International center Emanuele Buratti 826 47276 1 Emanuele Buratti Emanuele Buratti 826 77012 1 Emanuele Buratti Emanuele Buratti 826 77012 1 Emanuele Buratti Emanuele Buratti 826 79468 1 Emanuele Buratti Emanuele Buratti author_affiliation Genetic enginereing The International Centre for Genetic Engineering

String fuzzy matching in dataframe

末鹿安然 提交于 2019-12-24 23:41:09
问题 I have a dataframe containing the title of an article and the url links associated. My problem is that the url link is not necessary in the row of the corresponding title, example: title | urls Who will be the next president? | https://website/5-ways-to-make-a-cocktail.com 5 ways to make a cocktail | https://website/who-will-be-the-next-president.com 2 millions raised by this startup | https://website/how-did-you-find-your-house.com How did you find your house | https://website/2-millions

machine learning algorithm for spelling check

不想你离开。 提交于 2019-12-24 14:19:35
问题 I have a list of medicine names(regular_list) and a list of new names(new_list).I want to check whether the names in the new_list are already present in the regular_list or not.The issue is that the names new_list could have some typo errors and I want those name to be considered as a match to the regular list. I know that using stringdist is a solution to the problem but I need a machine learning algorithm 回答1: As it was already mentioned here machine learning to overcome typo errors ,

In R - fastest way pairwise comparing character strings on similarity

青春壹個敷衍的年華 提交于 2019-12-22 01:34:40
问题 I'm looking for a way to speed up the following approach. Any pointers are very welcome. Where are the bottlenecks? Say I have the following data.frame : df <- data.frame(names=c("A ADAM", "S BEAN", "A APPLE", "J BOND", "J BOND"), v1=c("Test_a", "Test_b", "Test_a", "Test_b", "Test_b"), v2=c("Test_c", "Test_c", "Test_d", "Test_d", "Test_d")) I want to compare each pair of rows in df on their JaroWinkler similarity. With some help of others (see this post), I've been able to construct this code

Passing arguments into multiple match_fun functions in R fuzzyjoin::fuzzy_join

有些话、适合烂在心里 提交于 2019-12-22 01:15:55
问题 I was answering these two questions and got an adequate solution, but I had trouble passing arguments using fuzzy_join into the match_fun that I extracted from fuzzyjoin::stringdist_join . In this case, I'm using a mix of multiple match_fun's, including this customized match_fun_stringdist and also == and <= for exact and criteria matching. The error message I'm getting is: # Error in mf(rep(u_x, n_y), rep(u_y, each = n_x), ...): object 'ignore_case' not found # Data: library(data.table,

Jaccard similarity in stringdist package to match words in character string

折月煮酒 提交于 2019-12-21 05:43:07
问题 I would like to use the Jaccard similarity in the stringdist function to determine the similarity of bags of words. From what I can tell, using Jaccard only matches by letters within a character string. c <- c('cat', 'dog', 'person') d <- c('cat', 'dog', 'ufo') stringdist(c, d, method='jaccard', q=2) [1] 0 0 1 So we see here that it calculates the similarity of 'cat' and 'cat', 'dog' and 'dog' and 'person' and 'ufo'. I also tried converting the words into 1 long text string. The following

Calculating string similarity as a percentage

早过忘川 提交于 2019-12-20 03:07:37
问题 The given function uses "stringdist" package in R and tells the minimum changes needed to change one string to another. I wish to find out how much similar is one string to another in "%" format. Please help me and thanks. stringdist("abc","abcd", method = "lv") 回答1: You can use RecordLinkage package and use the function levenshteinSim , i.e. #This gives the similarity RecordLinkage::levenshteinSim('abc', 'abcd') #[1] 0.75 #so to get the distance just subtract from 1, 1 - RecordLinkage: