How do I approximate “Did you mean?” without using Google?

后端 未结 7 2150
甜味超标
甜味超标 2021-01-31 10:13

I am aware of the duplicates of this question:

  • How does the Google “Did you mean?” Algorithm work?
  • How do you implement a “Did you mean”?
  • ... and
7条回答
  •  [愿得一人]
    2021-01-31 10:34

    Datasets/tools that might be useful:

    • WordNet
    • Corpora such as the ukWaC corpus

    You can use WordNet as a simple dictionary of terms, and you can boost that with frequent terms extracted from a corpus.

    You can use the Peter Norvig link mentioned before as a first attempt, but with a large dictionary, this won't be a good solution.

    Instead, I suggest you use something like locality sensitive hashing (LSH). This is commonly used to detect duplicate documents, but it will work just as well for spelling correction. You will need a list of terms and strings of terms extracted from your data that you think people may search for - you'll have to choose a cut-off length for the strings. Alternatively if you have some data of what people actually search for, you could use that. For each string of terms you generate a vector (probably character bigrams or trigrams would do the trick) and store it in LSH.

    Given any query, you can use an approximate nearest neighbour search on the LSH described by Charikar to find the closest neighbour out of your set of possible matches.

    Note: links removed as I'm a new user - sorry.

提交回复
热议问题