发表新帖

发表新帖

How do I approximate “Did you mean?” without using Google?

后端未结

关注

 7  2150

甜味超标 2021-01-31 10:13

I am aware of the duplicates of this question:

How does the Google “Did you mean?” Algorithm work?
How do you implement a “Did you mean”?
... and

7条回答

[愿得一人] (楼主)

2021-01-31 10:34
Datasets/tools that might be useful:
- WordNet
- Corpora such as the ukWaC corpus
You can use WordNet as a simple dictionary of terms, and you can boost that with frequent terms extracted from a corpus.

You can use the Peter Norvig link mentioned before as a first attempt, but with a large dictionary, this won't be a good solution.

Instead, I suggest you use something like locality sensitive hashing (LSH). This is commonly used to detect duplicate documents, but it will work just as well for spelling correction. You will need a list of terms and strings of terms extracted from your data that you think people may search for - you'll have to choose a cut-off length for the strings. Alternatively if you have some data of what people actually search for, you could use that. For each string of terms you generate a vector (probably character bigrams or trigrams would do the trick) and store it in LSH.

Given any query, you can use an approximate nearest neighbour search on the LSH described by Charikar to find the closest neighbour out of your set of possible matches.

Note: links removed as I'm a new user - sorry.
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...

热议问题