问题
I have a mapping of catalog numbers to product names:
35 cozy comforter
35 warm blanket
67 pillow
and need a search that would find misspelled, mixed names like "warm cmfrter".
We have code using edit-distance (difflib), but it probably won't scale to the 18000 names.
I achieved something similar with Lucene, but as PyLucene only wraps Java that would complicate deployment to end-users.
SQLite doesn't usually have full-text or scoring compiled in.
The Xapian bindings are like C++ and have some learning curve.
Whoosh is not yet well-documented but includes an abusable spell-checker.
What else is there?
回答1:
Apparently the only way to make fuzzy comparisons fast is to do less of them ;)
Instead of writing another n-gram search or improving the one in Whoosh we now keep a word index, retrieve all entries that have at least one (correctly spelled) word in common with the query, and use difflib to rank those. Works well enough in this case.
回答2:
You will get too many false positives with the SOUNDEX implementation. There are only 26,000 (at most) possible SOUNDEX codes.
Though the Metaphone algorithim was designed for English surnames, it works pretty well for misspellings; I used it once in a branch locator and it was quite successful.
Add a field with the Metaphone translation and match against it if no exact match is found. You will still get false positives, but fewer that with other algorithims.
回答3:
Nucular has full text search, but it doesn't support misspelled matches out-of-the box. You could try adding an additional field to each entry which indexes the SOUNDEX translation of the terms and then search using the soundex translation of the user input. I really don't know how well this would work...
Take a look at advas http://advas.sourceforge.net/news.php which has a nice demo comparing various soundex-alike methods:
advas/examples Aaron$ python phonetic_algorithms.py
soundex metaphone nyiis caverphone
====================================================================================================
schmidt : S253 sxmtt sssnad SKMT111111
schmid : S253 sxmt sssnad SKMT111111
schmitt : S253 sxmt sssnatt SKMT111111
smith : S530 sm0h snatt SMT1111111
smythe : S530 smy0h snatt SMT1111111
schmied : S253 sxmt sssnaad SKMT111111
mayer : M600 myr naaar MA11111111
meier : M600 mr naaar MA11111111
....
I don't know if any of them is good for your unnamed language...
回答4:
The usual way of computing the edit-distance between two strings is a rather expensive algorithm (if I remember correctly, its time complexity is quadratic). Maybe if you used a different string similarity metric then your problem would go away.
One of my favorite methods for fuzzy string matching is trigrams matching. Comparing two strings using this method has a linear time complexity, which is much better than the mentioned edit-distance. You can find my Python implementation on Github. There's also a PostgreSQL contrib module exactly for that. It shouldn't be too difficult to adapt it to work with SQLite3.
回答5:
Sybase SQL Anywhere has a free Web edition/Developer Edition and comes with full text indexing/search and a FUZZY operator (and some implementation constraints).
To quote from the documentation:
Specifying 'FUZZY "500 main street"' is equivalent to
'500 OR mai OR ain OR str OR tre OR ree OR eet'.
An other approach would be to use scoring on full-text searches.
回答6:
sqlite3 supports python callback functions. Matthew Barnett's regex (http://code.google.com/p/mrab-regex-hg/) now supports approximate matching.
So, like this:
try:
import regex
except ImportError:
sys.stderr.write("Can't import mrab-regex; see http://pypi.python.org/pypi/regex\n")
sys.exit(1)
def _sqlite3_regex(expr, item):
return (not (not regex.search(expr, item)))
def main():
...
database = sqlite3.connect(dbfile)
database.create_function("regexp", 2, _sqlite3_regex)
pattern = '(?:%s){e<=%d}' % (queriedname, distance)
print [x for x in database.cursor().execute(
"SELECT * FROM products WHERE (productname regexp '%s')" % pattern)]
来源:https://stackoverflow.com/questions/834570/how-to-do-fuzzy-string-search-without-a-heavy-database