fuzzy-comparison

q-gram approximate matching optimisations

不问归期 提交于 2019-12-03 11:04:27
问题 I have a table containing 3 million people records on which I want to perform fuzzy matching using q-grams (on surname for instance). I have created a table of 2-grams linking to this, but search performance is not great on this data volume (around 5 minutes). I basically have two questions: (1) Can you suggest any ways to improve performance to avoid a table scan (i.e. having to count common q-grams between the search string and 3 million surnames) (2) With q-grams, if A is similar to B and

q-gram approximate matching optimisations

99封情书 提交于 2019-12-03 00:37:48
I have a table containing 3 million people records on which I want to perform fuzzy matching using q-grams (on surname for instance). I have created a table of 2-grams linking to this, but search performance is not great on this data volume (around 5 minutes). I basically have two questions: (1) Can you suggest any ways to improve performance to avoid a table scan (i.e. having to count common q-grams between the search string and 3 million surnames) (2) With q-grams, if A is similar to B and C is similar to B, does it imply C is similar to A? Kind regards Peter I've been looking into fuzzy

How do I fuzzy match items in a column of an array in python?

末鹿安然 提交于 2019-12-02 13:20:02
问题 I have an array of team names from NCAA, along with statistics associated with them. The school names are often shortened or left out entirely, but there is usually a common element in all variations of the name (like Alabama Crimson Tide vs Crimson Tide). These names are all contained in an array in no particular order. I would like to be able to take all variations of a team name by fuzzy matching them and rename all variants to one name. I'm working in python 2.7 and I have a numpy array

How do I fuzzy match items in a column of an array in python?

佐手、 提交于 2019-12-02 07:58:31
I have an array of team names from NCAA, along with statistics associated with them. The school names are often shortened or left out entirely, but there is usually a common element in all variations of the name (like Alabama Crimson Tide vs Crimson Tide). These names are all contained in an array in no particular order. I would like to be able to take all variations of a team name by fuzzy matching them and rename all variants to one name. I'm working in python 2.7 and I have a numpy array with all of the data. Any help would be appreciated, as I have never used fuzzy matching before. I have

How can I use jaro-winkler to find the closest value in a table?

丶灬走出姿态 提交于 2019-12-02 06:22:19
问题 I have an implementation of the jaro-winkler algorithm in my database. I did not write this function. The function compares two values and gives the probability of match. So jaro(string1, string2, matchnoofchars) will return a result. Instead of comparing two strings, I want to send one string with a matchnoofchars and then get a result set with the probability higher than 95%. For example the current function is able to return 97.62% for jaro("Philadelphia","Philadelphlaa",9) I wish to tweak

How can I use jaro-winkler to find the closest value in a table?

 ̄綄美尐妖づ 提交于 2019-12-02 00:26:48
I have an implementation of the jaro-winkler algorithm in my database. I did not write this function. The function compares two values and gives the probability of match. So jaro(string1, string2, matchnoofchars) will return a result. Instead of comparing two strings, I want to send one string with a matchnoofchars and then get a result set with the probability higher than 95%. For example the current function is able to return 97.62% for jaro("Philadelphia","Philadelphlaa",9) I wish to tweak this function so that I am able to find "Philadelphia" for an input of "Philadelphlaa". What kind of

Python “regex” module: Fuzziness value

倾然丶 夕夏残阳落幕 提交于 2019-12-01 04:11:02
问题 I'm using the "fuzzy match" functionality of the Regex module. How can I get the "fuzziness value" of a "match" which indicates how different the pattern is to the string, just like the "edit distance" in Levenshtein? I thought I could get the value in the Match object, but it's not there. The official docs said nothing about it, neither. e.g.: regex.match('(?:foo){e}','for') a.captures() tells me that the word "for" is matched, but I'd like to know the fuzziness value, which should be 1 in

How to group / compare similar news articles

蓝咒 提交于 2019-11-30 04:00:06
In an app that i'm creating, I want to add functionality that groups news stories together. I want to group news stories about the same topic from different sources into the same group. For example, an article on XYZ from CNN and MSNBC would be in the same group. I am guessing its some sort of fuzzy logic comparison. How would I go about doing this from a technical standpoint? What are my options? We haven't even started the app yet, so we aren't limited in the technologies we can use. Thanks, in advance for the help! This problem breaks down into a few subproblems from a machine learning

Find Match of two data frames and rewrite the answer as data frame

混江龙づ霸主 提交于 2019-11-29 08:50:17
i have two data frames which are cleaned and merged as a single csv file , the data frames are like this **Source Master** chang chun petrochemical CHANG CHUN GROUP chang chun plastics CHURCH AND DWIGHT CO INC church dwight CITRIX SYSTEMS ASIA PACIFIC P L citrix systems pacific CNH INDUSTRIAL N.V now from these , i have to consider the first name and check with each name of master names and find a match that is relevant and print the output as another data frame. the above data frames are few , but i am working with 20k values as such. My output must look like this **Source Master Result**

Quicker way to perform fuzzy string match in pandas

馋奶兔 提交于 2019-11-29 08:49:28
Is there any way to speed up the fuzzy string match using fuzzywuzzy in pandas. I have a dataframe as extra_names which has names that I want to run fuzzy matches for with another dataframe as names_df . >> extra_names.head() not_matching 0 Vij Sales 1 Crom Electronics 2 REL Digital 3 Bajaj Elec 4 Reliance Digi >> len(extra_names) 6500 >> names_df.head() names types 0 Vijay Sales 1 1 Croma Electronics 1 2 Reliance Digital 2 3 Bajaj Electronics 2 4 Pai Electricals 2 >> len(names_df) 250 As of now, I'm running the logic using the following code, but its taking forever to complete. choices =