fuzzy-comparison

R - Merging two data files based on partial matching of inconsistent full name formats

落爺英雄遲暮 提交于 2019-12-07 01:58:29
Here is my previous question reposted with R format. I'm looking for a way to merge two data files based on partial matching of participants' full names that are sometimes entered in different formats and sometimes misspelled. I know there are some different function options for partial matches (eg agrep and pmatch) and for merging data files but I need help with a) combining the two; b) doing partial matching that can ignore middle names; c) in the merged data file store both original name formats and d) retain unique values even if they don't have a match. For example, I have the following

Fuzzy match row in one column with same row in next column

你离开我真会死。 提交于 2019-12-06 11:29:28
I would like to find information in one column based on the other column. So I have some words in one column and complete sentences in another. I would like to know whether it finds the words in those sentences. But sometimes the words are not the same so I cannot use the SQL like function. Thus I think fuzzy matching + some sort of 'like' function would be helpful as the data looks like this: Names Sentences Airplanes Sarl Airplanes-Sàrl is part of Airplanes-Group Sarl. Kidco Ltd. 100% ownership of Kidco.Ltd. is the mother company. Popsi Co. Cola Inc. is 50% share of PopsiCo which is part of

Fuzzy logic on big datasets using Python

烈酒焚心 提交于 2019-12-06 07:14:08
问题 My team has been stuck with running a fuzzy logic algorithm on a two large datasets. The first (subset) is about 180K rows contains names, addresses, and emails for the people that we need to match in the second (superset). The superset contains 2.5M records. Both have the same structure and the data has been cleaned already, i.e. addresses parsed, names normalized, etc. ContactID int, FullName varchar(150), Address varchar(100), Email varchar(100) The goal is to match values in a row of

Position of Approximate Substring Matches in R

佐手、 提交于 2019-12-06 04:28:00
I'm using R for string processing. I have a data frame with a column of strings, say: df <- data.frame(textcol=c("In this substring would like to find the position of this substring", "I would also like to find the position of thes substring", "No match here","No mention of this substrangy thing")) matchPattern <- "this substring" I am searching for a function that (depending on a distance parameter of some sort, say Jarro-Winkler) would take my matchPattern, compare it to every row of the data frame text column, and return the exact position of the match within the matched string, i.e. 36

How to normalize company names

荒凉一梦 提交于 2019-12-06 01:55:48
We have user generated names of employers that come in all variations. For example, people have typed in or imported: Google Google, Inc. Google Inc. Google inc To a database search this, looks like a different company all together. We've changed some things to map each employer to a "normalized" name, but with 70,000 in total, it becomes hard to do it by hand. Does anyone have suggestions on how to normalize the existing entries, and also how to maintain we do it for all incoming names as well? There are two things you can do to help: When users are adding a company name, give them an

Lucene.net Fuzzy Phrase Search

跟風遠走 提交于 2019-12-05 22:29:10
I have tried this myself for a considerable period and looked everywhere around the net - but have been unable to find ANY examples of Fuzzy Phrase searching via Lucene.NET 2.9.2. ( C# ) Is something able to advise how to do this in detail and/or provide some example code - I would seriously seriously appreciate any help as I am totally stuck ? I assume that you have Lucene running and created a search index with some fields in it. So let's assume further that: var fields = ... // a string[] of the field names you wish to search in var version = Version.LUCENE_29; // your Lucene version var

elasticsearch fuzzy matching max_expansions & min_similarity

懵懂的女人 提交于 2019-12-05 10:39:54
问题 I'm using fuzzy matching in my project mainly to find misspellings and different spellings of the same names. I need to exactly understand how the fuzzy matching of elastic search works and how it uses the 2 parameters mentioned in the title. As I understand the min_similarity is a percent by which the queried string matches the string in the database. I couldn't find an exact description of how this value is calculated. The max_expansions as I understand is the Levenshtein distance by which

elasticsearch fuzzy matching max_expansions & min_similarity

泪湿孤枕 提交于 2019-12-03 22:57:40
I'm using fuzzy matching in my project mainly to find misspellings and different spellings of the same names. I need to exactly understand how the fuzzy matching of elastic search works and how it uses the 2 parameters mentioned in the title. As I understand the min_similarity is a percent by which the queried string matches the string in the database. I couldn't find an exact description of how this value is calculated. The max_expansions as I understand is the Levenshtein distance by which a search should be executed. If this actually was Levenshtein distance it would have been the ideal

Joining/matching data frames in R

僤鯓⒐⒋嵵緔 提交于 2019-12-03 22:02:08
I have two data frames. The first one has two columns: x is water depth, y is temperature at each depth. The second one has two columns too, x is also water depth, but at different depth compared to that in the first table. The second column z is salinity. I want to join the two tables by x , by adding z to the first table. I have learned how to join tables using 'key' in tidyr , but that only works if the keys are identical. The x in these two tables are not the same. What I want to do is to match the depth x in table 2 to that within 10% of that in table 1 (i.e. match 1.1 in table 2 x to 1.0

SQL Fuzzy Join - MSSQL

梦想的初衷 提交于 2019-12-03 20:57:25
I have two sets of data. Existing customers and potential customers. My main objective is to figure out if any of the potential customers are already existing customers. However, the naming conventions of customers across data sets are inconsistent. EXISTING CUSTOMERS Customer / ID Ed's Barbershop / 1002 GroceryTown / 1003 Candy Place / 1004 Handy Man / 1005 POTENTIAL CUSTOMERS Customer Eds Barbershop Grocery Town Candy Place Handee Man Beauty Salon The Apple Farm Igloo Ice Cream Ride-a-Long Bikes I would like to write some type of select statement like below to reach my objective: SELECT a