Creating a “spell check” that checks against a database with a reasonable runtime

后端 未结 6 1675
Happy的楠姐
Happy的楠姐 2021-01-31 05:22

I\'m not asking about implementing the spell check algorithm itself. I have a database that contains hundreds of thousands of records. What I am looking to do is checking a user

6条回答
  •  北恋
    北恋 (楼主)
    2021-01-31 05:48

    Calculating Levenshtein distance doesn't have to be as costly as you might think. The code in the Norvig article can be thought of as psuedocode to help the reader understand the algorithm. A much more efficient implementation (in my case, approx 300 times faster on a 20,000 term data set) is to walk a trie. The performance difference is mostly attributed to removing the need to allocate millions of strings in order to do dictionary lookups, spending much less time in the GC, and you also get better locality of reference so have fewer CPU cache misses. With this approach I am able to do lookups in around 2ms on my web server. An added bonus is the ability to return all results that start with the provided string easily.

    The downside is that creating the trie is slow (can take a second or so), so if the source data changes regularly then you need to decide whether to rebuild the whole thing or apply deltas. At any rate, you want to reuse the structure as much as possible once it's built.

提交回复
热议问题