fuzzy-search

How to do fuzzy string search without a heavy database?

有些话、适合烂在心里 提交于 2019-12-07 02:29:00
问题 I have a mapping of catalog numbers to product names: 35 cozy comforter 35 warm blanket 67 pillow and need a search that would find misspelled, mixed names like "warm cmfrter" . We have code using edit-distance (difflib), but it probably won't scale to the 18000 names. I achieved something similar with Lucene, but as PyLucene only wraps Java that would complicate deployment to end-users. SQLite doesn't usually have full-text or scoring compiled in. The Xapian bindings are like C++ and have

Wildcard and Fuzzy query together in elastic search

*爱你&永不变心* 提交于 2019-12-06 15:53:50
问题 I am trying to design a query in which, I can use wildcard and Fuzzy query together. According to me, query_string is used for wildcard searches and multi_match can be used for fuzziness. I want a query which will search on words :- "elast" : - provide results elastic and elasticsearch. "elasttc" :- also provide results as elastic and elasticsearch. Elastic search supports wildcard and fuzzy query together?? Thanks in advance... 回答1: { "query": { "bool": { "should": [ { "match": { "title":

A good SQL strategy for fuzzy matching possible duplicates using SQL Server 2005

走远了吗. 提交于 2019-12-06 12:24:10
I want to find possible candidate duplicate records in a large database matching on fields like COMPANYNAME and ADDRESSLINE1 Example: For a record with the following COMPANYNAME: "Acme, Inc." I would like for my query to spit out other records with these COMPANYNAME values as possible dups: "Acme Corporation" "Acme, Incorporated" "Acme" I know how to do the joins, correlated subqueries, etc. to do the mechanics of pulling the set of data I want. And I know that has been covered on here before. I am interested hearing thoughts on the best way to do the fuzzy searching - should I use full-text

Search for similar words using an index

廉价感情. 提交于 2019-12-06 02:55:16
I need to search over a DB table using some kind of fuzzy search like the one from oracle and using indexes since I do not want a table scan(there is a lot of data). I want to ignore case, language special stuff(ñ, ß, ...) and special characters like _, (), -, etc... Search for "maria (cool)" should get "maria- COOL" and "María_Cool" as matches. Is that possible in Oracle in some way? About the case, I think it can be solved created the index directly in lower case and searching always lower-cased. But I do not know how to solve the special chars stuff. I thought about storing the data without

Lucene Fuzzy Search for customer names and partial address

若如初见. 提交于 2019-12-05 23:58:12
问题 I was going thru all the existing questions posts but couldn't get something much relevant. I have file with millions of records for person first name, last name, address1, address2, country code, date of birth - I would like to check my list of customers with above file on daily basis (my customer list also get updated daily and file also gets updated daily). For first name and last name I would like fuzzy match (may be lucene fuzzyquery/levenshtein distance 90% match) and for remaining

How can I find the best fuzzy string match?

丶灬走出姿态 提交于 2019-12-05 23:37:07
Python's new regex module supports fuzzy string matching. Sing praises aloud (now). Per the docs: The ENHANCEMATCH flag makes fuzzy matching attempt to improve the fit of the next match that it finds. The BESTMATCH flag makes fuzzy matching search for the best match instead of the next match The ENHANCEMATCH flag is set using (?e) as in regex.search("(?e)(dog){e<=1}", "cat and dog")[1] returns "dog" but there's nothing on actually setting the BESTMATCH flag. How's it done? Documentation on the BESTMATCH flag functionality is partial (but improving). Poke-n-hope shows that BESTMATCH is set

Lucene.net Fuzzy Phrase Search

跟風遠走 提交于 2019-12-05 22:29:10
I have tried this myself for a considerable period and looked everywhere around the net - but have been unable to find ANY examples of Fuzzy Phrase searching via Lucene.NET 2.9.2. ( C# ) Is something able to advise how to do this in detail and/or provide some example code - I would seriously seriously appreciate any help as I am totally stuck ? I assume that you have Lucene running and created a search index with some fields in it. So let's assume further that: var fields = ... // a string[] of the field names you wish to search in var version = Version.LUCENE_29; // your Lucene version var

Fast Dynamic Fuzzy search over 100k+ strings in C#

旧街凉风 提交于 2019-12-05 21:04:05
问题 Let's say they are pre-loaded stock symbols, typed into a text box. I am looking for code that I can copy, not a library to install. This was inspired by this question: Are there any Fuzzy Search or String Similarity Functions libraries written for C#? The Levenstein distance algorithm seems to work well, but it takes time to compute. Are there any optimizations around the fact that the query will need to re-run as the user types in an extra letter? I am interested in showing at most the top

find best subset from list of strings to match a given string

大憨熊 提交于 2019-12-05 17:59:38
I have a string s = "mouse" and a list of string sub_strings = ["m", "o", "se", "e"] I need to find out what is the best and shortest matching subset of sub_strings the list that matches s. What is the best way to do this? The ideal result would be ["m", "o", "se"] since together they spell mose You can use a regular expression: import re def matches(s, sub_strings): sub_strings = sorted(sub_strings, key=len, reverse=True) pattern = '|'.join(re.escape(substr) for substr in sub_strings) return re.findall(pattern, s) This is at least short and quick, but it will not necessarily find the best set

elasticsearch fuzzy matching max_expansions & min_similarity

懵懂的女人 提交于 2019-12-05 10:39:54
问题 I'm using fuzzy matching in my project mainly to find misspellings and different spellings of the same names. I need to exactly understand how the fuzzy matching of elastic search works and how it uses the 2 parameters mentioned in the title. As I understand the min_similarity is a percent by which the queried string matches the string in the database. I couldn't find an exact description of how this value is calculated. The max_expansions as I understand is the Levenshtein distance by which