fuzzy-search

elasticsearch fuzzy matching max_expansions & min_similarity

泪湿孤枕 提交于 2019-12-03 22:57:40
I'm using fuzzy matching in my project mainly to find misspellings and different spellings of the same names. I need to exactly understand how the fuzzy matching of elastic search works and how it uses the 2 parameters mentioned in the title. As I understand the min_similarity is a percent by which the queried string matches the string in the database. I couldn't find an exact description of how this value is calculated. The max_expansions as I understand is the Levenshtein distance by which a search should be executed. If this actually was Levenshtein distance it would have been the ideal

SQL Fuzzy Join - MSSQL

梦想的初衷 提交于 2019-12-03 20:57:25
I have two sets of data. Existing customers and potential customers. My main objective is to figure out if any of the potential customers are already existing customers. However, the naming conventions of customers across data sets are inconsistent. EXISTING CUSTOMERS Customer / ID Ed's Barbershop / 1002 GroceryTown / 1003 Candy Place / 1004 Handy Man / 1005 POTENTIAL CUSTOMERS Customer Eds Barbershop Grocery Town Candy Place Handee Man Beauty Salon The Apple Farm Igloo Ice Cream Ride-a-Long Bikes I would like to write some type of select statement like below to reach my objective: SELECT a

How to get Lucene Fuzzy Search result 's matching terms?

走远了吗. 提交于 2019-12-03 20:10:23
how do you get the matching fuzzy term and its offset when using Lucene Fuzzy Search? IndexSearcher mem = ....(some standard code) QueryParser parser = new QueryParser(Version.LUCENE_30, CONTENT_FIELD, analyzer); TopDocs topDocs = mem.search(parser.parse("wuzzy~"), 1); // the ~ triggers the fuzzy search as per "Lucene In Action" The fuzzy search works fine. If a document contains the term "fuzzy" or "luzzy", it is matched. How do I get which term matched and what are their offsets? I have made sure that all CONTENT_FIELDs are added with termVectorStored with positions and offsets . user193116

Fuzzy text search in python

非 Y 不嫁゛ 提交于 2019-12-03 14:29:36
I am wondering if there has any Python library can conduct fuzzy text search. For example: I have three keywords "letter" , "stamp" , and "mail" . I would like to have a function to check if those three words are within the same paragraph (or certain distances, one page). In addition, those words have to maintain the same order. It is fine that other words appear between those three words. I have tried fuzzywuzzy which did not solve my problem. Another library Whoosh looks powerful, but I did not find the proper function... {1} You can do this in Whoosh 2.7 . It has fuzzy search by adding the

Is it possible to perform T-SQL fuzzy lookup without SSIS?

喜你入骨 提交于 2019-12-03 12:40:55
问题 SSIS 2005/2008 does fuzzy lookups and groupings. Is there a feature that does the same in T-SQL? 回答1: Fuzzy lookup uses a q-gram approach, by breaking strings up into tiny sub-strings and indexing them. You can then then search input by breaking it up into equally sized strings. You can inspect the format of their index and write a CLR function to use the same style of index but you might be talking about a fair chunk of work. It is actually quite interesting how they did it, very simple yet

q-gram approximate matching optimisations

不问归期 提交于 2019-12-03 11:04:27
问题 I have a table containing 3 million people records on which I want to perform fuzzy matching using q-grams (on surname for instance). I have created a table of 2-grams linking to this, but search performance is not great on this data volume (around 5 minutes). I basically have two questions: (1) Can you suggest any ways to improve performance to avoid a table scan (i.e. having to count common q-grams between the search string and 3 million surnames) (2) With q-grams, if A is similar to B and

Fast fuzzy/approximate search in dictionary of strings in Ruby

↘锁芯ラ 提交于 2019-12-03 06:55:14
I have a dictionary of 50K to 100K strings (can be up to 50+ characters) and I am trying to find whether a given string is in the dictionary with some "edit" distance tolerance. (Levenshtein for example). I am fine pre-computing any type of data structure before doing the search. My goal to run thousands of strings against that dictionary as fast as possible and returns the closest neighbor. I would be fine just getting a boolean that say whether a given is in the dictionary or not if there was a significantly faster algorithm to do so For this, I first tried to compute all the Levenshtein

ElasticSearch's Fuzzy Query

♀尐吖头ヾ 提交于 2019-12-03 06:51:34
问题 I am brand new to ElasticSearch , and am currently exploring its features. One of them I am interested in is the Fuzzy Query , which I am testing and having troubles to use. It is probably a dummy question so I guess someone who already used this feature will quickly find the answer, at least I hope. :) BTW I have the feeling that it might not be only related to ElasticSearch but maybe directly to Lucene . Let's start with a new index named "first index" in which I store an object "label"

how to do fuzzy search in big data

浪子不回头ぞ 提交于 2019-12-03 05:58:42
问题 I'm new to that area and I wondering mostly what the state-of-the-art is and where I can read about it. Let's assume that I just have a key/value store and I have some distance(key1,key2) defined somehow (not sure if it must be a metric, i.e. if the triangle inequality must hold always). What I want is mostly a search(key) function which returns me all items with keys up to a certain distance to the search-key. Maybe that distance-limit is configureable. Maybe this is also just a lazy

Whats the easiest site search application to implement, that supports fuzzy searching?

ぐ巨炮叔叔 提交于 2019-12-03 03:40:32
I have a site that needs to search thru about 20-30k records, which are mostly movie and TV show names. The site runs php/mysql with memcache. Im looking to replace the FULLTEXT with soundex() searching that I currently have, which works... kind of, but isn't very good in many situations. Are there any decent search scripts out there that are simple to implement, and will provide a decent searching capability (of 3 columns in a table). ewemli's answer is in the right direction but you should be combining FULLTEXT and soundex mapping, not replacing the fulltext, otherwise your LIKE queries are