How to find best fuzzy match for a string in a large string database

混江龙づ霸主 提交于 2019-11-28 06:26:15

This paper seems to describe exactly what you want.

Lucene (http://lucene.apache.org/) also implements Levenshtein edit distance.

You didn't mention your database system, but for PostrgreSQL you could use the following contrib module: trgm - Trigram matching for PostgreSQL

The pg_trgm contrib module provides functions and index classes for determining the similarity of text based on trigram matching.

If your database supports it, you should use full-text search. Otherwise, you can use an indexer like lucene and its various implementations.

Compute the SOUNDEX hash (which is built into many SQL database engines) and index by it.

SOUNDEX is a hash based on the sound of the words, so spelling errors of the same word are likely to have the same SOUNDEX hash.

Then find the SOUNDEX hash of the search string, and match on it.

rhinof

Since the amount of data is large, when inserting a record I would compute and store the value of the phonetic algorithm in an indexed column and then constrain (WHERE clause) my select queries within a range on that column.

A very extensive explanation of relevant algorithms is in the book Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology by Dan Gusfield.

https://en.wikipedia.org/wiki/Levenshtein_distance

Levenshtein algorithm has been implemented in some DBMS

(E.g. PostgreSql: http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!