How to implement autocomplete on a massive dataset

前端 未结 7 858
佛祖请我去吃肉
佛祖请我去吃肉 2021-01-29 21:54

I\'m trying to implement something like Google suggest on a website I am building and am curious how to go about doing in on a very large dataset. Sure if you\'ve got 1000 items

相关标签:
7条回答
  • 2021-01-29 21:57

    Don't try to implement this yourself (unless you're just curious). Use something like Lucene or Endeca - it will save you time and hair.

    0 讨论(0)
  • 2021-01-29 21:58

    As I pointed out in How to implement incremental search on a list you should use structures like a Trie or Patricia trie for searching patterns in large texts.

    And for discovering patterns in the middle of some text there is one simple solution. I am not sure if it is the most efficient solution, but I usually do it as follows.

    When I insert some new text into the Trie, I just insert it, then remove the first character, insert again, remove the second character, insert again ... and so on until the whole text is consumed. Then you can discover every substring of every inserted text by just one search from the root. That resulting structure is called a Suffix Tree and there are a lot of optimizations available.

    And it is really incredible fast. To find all texts that contain a given sequence of n characters you have to inspect at most n nodes and perform a search on the list of children for every node. Depending on the implementation (array, list, binary tree, skip list) of the child node collection, you might be able to identify the required child node with as few as 5 search steps assuming case insensitive latin letters only. Interpolation sort might be helpful for large alphabets and nodes with many children as those usually found near the root.

    0 讨论(0)
  • 2021-01-29 22:01

    I've built AutoCompleteAPI for this scenario exactly.

    Sign up to get a private index, then, Upload your documents.

    Example upload using curl on document "New York":

    curl -X PUT -H "Content-Type: application/json" -H "Authorization: [YourSecretKey]" -d '{
    "key": "New York",
    "input": "New York"
    }' "http://suggest.autocompleteapi.com/[YourAccountKey]/[FieldName]"
    

    After indexing all document, to get autocomplete suggestions, use:

    http://suggest.autocompleteapi.com/[YourAccountKey]/[FieldName]?prefix=new
    

    You can use any client autocomplete library to show these results to the user.

    0 讨论(0)
  • 2021-01-29 22:08

    I would use something along the lines of a trie, and have the value of each leaf node be a list of the possibilities that contain the word represented by the leaf node. You could sort them in order of likelihood, or dynamically sort/filter them based on other words the user has entered into the search box, etc. It will execute very quickly and in a reasonable amount of RAM.

    0 讨论(0)
  • 2021-01-29 22:11

    You keep the items on the server side (perhaps in a DB, if the dataset is really large and complex) and you send AJAX calls from the client's browser that return the results using json/xml. You can do this in response to the user typing, or with a timer.

    0 讨论(0)
  • 2021-01-29 22:13

    if you don't want a trie and you want stuff from the middle of the string, you generally want to run some sort of edit distance function (levenshtein distance) which will give you a number indicating how well 2 strings match up. it's not a particularly efficient algorithm, but it doesn't matter too much for things like words, as they're relatively short. if you're running comparisons on like, 8000 character strings it'll probably take a few seconds. i know most languages have an implementation, or you can find code/pseudocode for it pretty easily on the internet.

    0 讨论(0)
提交回复
热议问题