How do I approximate “Did you mean?” without using Google?

后端 未结 7 2167
甜味超标
甜味超标 2021-01-31 10:13

I am aware of the duplicates of this question:

  • How does the Google “Did you mean?” Algorithm work?
  • How do you implement a “Did you mean”?
  • ... and
相关标签:
7条回答
  • 2021-01-31 10:34

    Datasets/tools that might be useful:

    • WordNet
    • Corpora such as the ukWaC corpus

    You can use WordNet as a simple dictionary of terms, and you can boost that with frequent terms extracted from a corpus.

    You can use the Peter Norvig link mentioned before as a first attempt, but with a large dictionary, this won't be a good solution.

    Instead, I suggest you use something like locality sensitive hashing (LSH). This is commonly used to detect duplicate documents, but it will work just as well for spelling correction. You will need a list of terms and strings of terms extracted from your data that you think people may search for - you'll have to choose a cut-off length for the strings. Alternatively if you have some data of what people actually search for, you could use that. For each string of terms you generate a vector (probably character bigrams or trigrams would do the trick) and store it in LSH.

    Given any query, you can use an approximate nearest neighbour search on the LSH described by Charikar to find the closest neighbour out of your set of possible matches.

    Note: links removed as I'm a new user - sorry.

    0 讨论(0)
  • 2021-01-31 10:36

    From the horse's mouth: How to Write a Spelling Corrector

    The interesting thing here is how you don't need a bunch of query logs to approximate the algorithm. You can use a corpus of mostly-correct text (like a bunch of books from Project Gutenberg).

    0 讨论(0)
  • 2021-01-31 10:37

    @Legend - Consider using one of the variations of the Soundex algorithm. It has some known flaws, but it works decently well in most applications that need to approximate misspelled words.


    Edit (2011-03-16):

    I suddenly remembered another Soundex-like algorithm that I had run across a couple of years ago. In this Dr. Dobb's article, Lawrence Philips discusses improvements to his Metaphone algorithm, dubbed Double Metaphone.

    You can find a Python implementation of this algorithm here, and more implementations on the same site here.

    Again, these algorithms won't be the same as what Google uses, but for English language words they should get you very close. You can also check out the wikipedia page for Phonetic Algorithms for a list of other similar algorithms.

    0 讨论(0)
  • 2021-01-31 10:39

    Impressive tutroail one how its work you can found here http://alias-i.com/lingpipe-3.9.3/demos/tutorial/querySpellChecker/read-me.html.

    In few word it is trade off of query modification(on character or word level) to increasing coverage in search documents. For example "aple" lead to 2mln documents, but "apple" lead to 60mln and modification is only one character, therefore it is obvious that you mean apple.

    0 讨论(0)
  • 2021-01-31 10:48

    I think this can be done using a spellchecker along with N-grams.

    For Trytoreconnectyou, we first check with all 1-grams (all dictionary words) and find a closest match that's pretty terrible. So we try 2-grams (which can be built by removing spaces from phrases of length 2), and then 3-grams and so on. When we try a 4-gram, we find that there is a phrase that is at 0 distance from our search term. Since we can't do better than that, we return that answer as the suggestion.

    I know this is very inefficient, but Peter Norvig's post here suggests clearly that Google uses spell correcters to generate it's suggestions. Since Google has massive paralellization capabilities, they can accomplish this task very quickly.

    0 讨论(0)
  • 2021-01-31 10:48

    Take a look at this: How does the Google "Did you mean?" Algorithm work?

    0 讨论(0)
提交回复
热议问题