fuzzy-search | 易学教程

Whats the easiest site search application to implement, that supports fuzzy searching?

阅读更多关于 Whats the easiest site search application to implement, that supports fuzzy searching?

问题 I have a site that needs to search thru about 20-30k records, which are mostly movie and TV show names. The site runs php/mysql with memcache. Im looking to replace the FULLTEXT with soundex() searching that I currently have, which works... kind of, but isn't very good in many situations. Are there any decent search scripts out there that are simple to implement, and will provide a decent searching capability (of 3 columns in a table). 回答1: ewemli's answer is in the right direction but you

SolrNet: How can I perform Fuzzy search in SolrNet?

阅读更多关于 SolrNet: How can I perform Fuzzy search in SolrNet?

问题 I am searching a "text" field in solr and I looking for a way to match (for e.g.) "anamal" with "animal". My schema for the "text" field looks like the following: <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" /> <filter class="solr.LowerCaseFilterFactory" />

Fuzzy string matching in r

阅读更多关于 Fuzzy string matching in r

问题 I have 2 datasets with more than 100K rows each. I would like to merge them based on fuzzy string matching one column('movie title') as well as using release date. I am providing a sample from both datasets below. dataset-1 itemid userid rating time title release_date 99991 1673 835 3 1998-03-27 mirage 1995 99992 1674 840 4 1998-03-29 mamma roma 1962 99993 1675 851 3 1998-01-08 sunchaser, the 1996 99994 1676 851 2 1997-10-01 war at home, the 1996 99995 1677 854 3 1997-12-22 sweet nothing 1995

classifying identically pattern in words using R

阅读更多关于 classifying identically pattern in words using R

问题 I want conduct text mining analysis, but face with any troubles. Using dput(), i load little part of my text. text<-structure(list(ID_C_REGCODES_CASH_VOUCHER = c(3941L, 3941L, 3941L, 3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3945L, 3953L, 3953L, 3953L, 3953L, 3953L, 3953L, 3960L, 3960L, 3960L, 3960L, 3960L, 3960L, 3967L, 3967L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), GOODS_NAME =

Elastic search query using match_phrase_prefix and fuzziness at the same time?

阅读更多关于 Elastic search query using match_phrase_prefix and fuzziness at the same time?

问题 I am new to elastic search, so I am struggling a bit to find the optimal query for our data. Imagine I want to match the following word "Handelsstandens Boldklub". Currently, I'm using the following query: { query: { bool: { should: [ { match: { name: { query: query, slop: 5, type: "phrase_prefix" } } }, { match: { name: { query: query, fuzziness: "AUTO", operator: "and" } } } ] } } } It currently list the word if I am searching for "Hand", but if I search for "Handle" the word will no longer

A good SQL strategy for fuzzy matching possible duplicates using SQL Server 2005

阅读更多关于 A good SQL strategy for fuzzy matching possible duplicates using SQL Server 2005

问题 I want to find possible candidate duplicate records in a large database matching on fields like COMPANYNAME and ADDRESSLINE1 Example: For a record with the following COMPANYNAME: "Acme, Inc." I would like for my query to spit out other records with these COMPANYNAME values as possible dups: "Acme Corporation" "Acme, Incorporated" "Acme" I know how to do the joins, correlated subqueries, etc. to do the mechanics of pulling the set of data I want. And I know that has been covered on here before

Search for similar words using an index

阅读更多关于 Search for similar words using an index

问题 I need to search over a DB table using some kind of fuzzy search like the one from oracle and using indexes since I do not want a table scan(there is a lot of data). I want to ignore case, language special stuff(ñ, ß, ...) and special characters like _, (), -, etc... Search for "maria (cool)" should get "maria- COOL" and "María_Cool" as matches. Is that possible in Oracle in some way? About the case, I think it can be solved created the index directly in lower case and searching always lower

find best subset from list of strings to match a given string

阅读更多关于 find best subset from list of strings to match a given string

问题 I have a string s = "mouse" and a list of string sub_strings = ["m", "o", "se", "e"] I need to find out what is the best and shortest matching subset of sub_strings the list that matches s. What is the best way to do this? The ideal result would be ["m", "o", "se"] since together they spell mose 回答1: You can use a regular expression: import re def matches(s, sub_strings): sub_strings = sorted(sub_strings, key=len, reverse=True) pattern = '|'.join(re.escape(substr) for substr in sub_strings)

Fuzzy sentence search algorithms

阅读更多关于 Fuzzy sentence search algorithms

问题 Suppose I have a set of phrases - about 10 000 - of average length - 7-20 words in which I want to find some given phrase. The phrase I am looking for could have some errors - for example miss one or two words, have some words misplaced, or some random words - for example my database contains "As I was riding my red bike, I saw Christine", and I want it to much "As I was riding my blue bike, saw Christine", or "I was riding my bike, I saw Christine and Marion". What could be some good

Using Levenshtein function on each element in a tsvector?

阅读更多关于 Using Levenshtein function on each element in a tsvector?

问题 I'm trying to create a fuzzy search using Postgres and have been using django-watson as a base search engine to work off of. I have a field called search_tsv that its a tsvector containing all the field values of the model that I want to search on. I was wanting to use the Levenshtein function, which does exactly what I want on a text field. However, I dont really know how to run it on each individual element of the tsvector. Is there a way to do this? 回答1: I would consider using the