stringdist

How to use custom SQL function in dbplyr?

て烟熏妆下的殇ゞ 提交于 2021-02-07 03:51:30
问题 I would like to calculate the Jaro-Winkler string distance in a database. If I bring the data into R (with collect ) I can easily use the stringdist function from the stringdist package. But my data is very large and I'd like to filter on Jaro-Winkler distances before pulling the data into R. There is SQL code for Jaro-Winkler (https://androidaddicted.wordpress.com/2010/06/01/jaro-winkler-sql-code/ and a version for T-SQL) but I guess I'm not sure how best to get that SQL code to work with

How to use custom SQL function in dbplyr?

只谈情不闲聊 提交于 2021-02-07 03:45:22
问题 I would like to calculate the Jaro-Winkler string distance in a database. If I bring the data into R (with collect ) I can easily use the stringdist function from the stringdist package. But my data is very large and I'd like to filter on Jaro-Winkler distances before pulling the data into R. There is SQL code for Jaro-Winkler (https://androidaddicted.wordpress.com/2010/06/01/jaro-winkler-sql-code/ and a version for T-SQL) but I guess I'm not sure how best to get that SQL code to work with

R Function to identify non-matching rows

房东的猫 提交于 2021-02-04 21:20:46
问题 I am trying to compare 2 data.frames, "V1" represents my CRM, "V2" represents Leads that I would like to send out. 'V1 has roughly 8k elements' 'V2 has roughly 25k elements' I need to compare every row in V2 to every row in V1, discard every instance where a V2 element exists in V1. I would then like to return only the elements that do not appear either exactly or loosely in V1 into the Leads column. The goal is to send out a lead(V2) that does not exist in CRM(V1). I've made some good

Fuzzy merging in R - seeking help to improve my code

一个人想着一个人 提交于 2021-01-20 19:53:24
问题 Inspired by the experimental fuzzy_join function from the statar package I wrote a function myself which combines exact and fuzzy (by string distances) matching. The merging job I have to do is quite big (resulting into multiple string distance matrices with a little bit less than one billion cells) and I had the impression that the fuzzy_join function is not written very efficiently (with regard to memory usage) and the parallelization is implemented in a weird manner (the computation of the

Fuzzy merging in R - seeking help to improve my code

冷暖自知 提交于 2021-01-20 19:51:36
问题 Inspired by the experimental fuzzy_join function from the statar package I wrote a function myself which combines exact and fuzzy (by string distances) matching. The merging job I have to do is quite big (resulting into multiple string distance matrices with a little bit less than one billion cells) and I had the impression that the fuzzy_join function is not written very efficiently (with regard to memory usage) and the parallelization is implemented in a weird manner (the computation of the

Fuzzy merging in R - seeking help to improve my code

那年仲夏 提交于 2021-01-20 19:50:53
问题 Inspired by the experimental fuzzy_join function from the statar package I wrote a function myself which combines exact and fuzzy (by string distances) matching. The merging job I have to do is quite big (resulting into multiple string distance matrices with a little bit less than one billion cells) and I had the impression that the fuzzy_join function is not written very efficiently (with regard to memory usage) and the parallelization is implemented in a weird manner (the computation of the

stringdist_join results in NAs

烂漫一生 提交于 2020-06-27 12:22:12
问题 i am experimenting with the stringdist package in order to make fuzzy joins and i run into a problem which i do not understand and fail to find an answer for. I want to join these 2 data tables with the "dl" method and it produces a NA, which i completely do not understand. Maybe one of you has an explanation for this. The code: library(fuzzyjoin) test1<-as.data.frame(test1<-c("techniker")) test2<-as.data.frame(test2<-c("technician")) setnames(test2,1,"label") setnames(test1,1,"label") x <-

stringdist_join results in NAs

隐身守侯 提交于 2020-06-27 12:21:43
问题 i am experimenting with the stringdist package in order to make fuzzy joins and i run into a problem which i do not understand and fail to find an answer for. I want to join these 2 data tables with the "dl" method and it produces a NA, which i completely do not understand. Maybe one of you has an explanation for this. The code: library(fuzzyjoin) test1<-as.data.frame(test1<-c("techniker")) test2<-as.data.frame(test2<-c("technician")) setnames(test2,1,"label") setnames(test1,1,"label") x <-

stringdist_join results in NAs

耗尽温柔 提交于 2020-06-27 12:21:30
问题 i am experimenting with the stringdist package in order to make fuzzy joins and i run into a problem which i do not understand and fail to find an answer for. I want to join these 2 data tables with the "dl" method and it produces a NA, which i completely do not understand. Maybe one of you has an explanation for this. The code: library(fuzzyjoin) test1<-as.data.frame(test1<-c("techniker")) test2<-as.data.frame(test2<-c("technician")) setnames(test2,1,"label") setnames(test1,1,"label") x <-

joining on inexact strings in R

≯℡__Kan透↙ 提交于 2020-03-05 04:01:50
问题 I am looking to join two tables.. however the data I am looking to join on does not match exactly.. joining on NFL player names.. data sets below.. > dput(att75a) structure(list(rusher_player_name = c("A.Ekeler", "A.Jones", "A.Kamara", "A.Mattison", "A.Peterson", "B.Hill"), mean_epa = c(-0.110459963350783, 0.0334332018597805, -0.119488111742492, -0.155261835310445, -0.123485646124451, -0.0689611296359916), success_rate = c(0.357664233576642, 0.40495867768595, 0.401129943502825, 0