What is a fast and efficient way to implement the server-side component for an autocomplete feature in an html input box?
I am writing a service to autocomplete use
I ended up resolving this one via Lucene; the initial performance tests seem sufficient for our use case. A little hacking was necessary to make the prefix queries work, as I was running into the TooManyClauses exception when expanding queries such as "Jeff At*". I ended up wrapping my IndexReader with a FilterIndexReader, and set hard cap on the number of terms returned on a prefix term call. Here's my code:
Directory directory = FSDirectory.getDirectory(indexDir);
IndexReader reader = IndexReader.open(directory);
FilterIndexReader filteredReader = new FilterIndexReader(reader) {
@Override public TermEnum terms(Term t) throws IOException {
final TermEnum origEnum = super.terms(t);
return new TermEnum() {
protected int count = 0;
@Override public boolean next() throws IOException {
if (count++ < (BooleanQuery.getMaxClauseCount() - 10))
return origEnum.next();
else return false;
}
@Override public Term term() {
return origEnum.term();
}
@Override public int docFreq() {
return origEnum.docFreq();
}
@Override public void close() throws IOException {
origEnum.close();
}
};
}
};
IndexSearcher searcher = new IndexSearcher(filteredReader);
I've done this for small data sets using a Ternary search tree. The DDJ code is not too difficult to convert to Java, but it assumes the entire data set will fit into memory. There are on-disk implementations of Ternary search trees (here is one in python), but of course they are going to be less performant. Since ternary search trees excel at partial matches, though, the performance might be suitable for your needs.
I used hashtable and mmap() And 10,000,000+ records term list isn't problem. See demo here: http://olegh.ath.cx/autocomplete.html
Are there possible solutions that will let me scale better
Yes, Oracle. This is kind of thing that databases are built for. Just index the relevant columns. If you are running against the wall of in-memory solutions, then the trade-off with disk seek time or network latency is probably moot. Especially if you insert a caching layer in between.
Also, you may be able to decrease the number of hits if you tweak your client-side code a little. Such as setting a minimum number of type characters before a query is run or setting a fraction of a second of delay after the user stops typing. If you are already using those, set them a bit higher.
Maybe I misunderstood your question but couldn't you use a JQuery plugin to Ajax the info to your app?
I have used this one before:
Ajax Auto Suggest v2
If you can't physically load all the data into RAM then you're going to have to deal with having some on disk.
What DB are you using?
For example Oracle has an option where you can keep the entire table in memory, and perform your queries against that.
MySQL also claims to have some in-memory capabilities, but I don't know much about MySQL.
You can then do away with your java based cache, or you could use the cache for the most popular/recent searches.
Obviously when you run out of RAM then some of the data will be on disk when you query for it, but depending on the load on the system, this will only be an issue for the first keypress, not the subsequent ones, as the row will be in memory after that.
If the disk seek is slowing you down, then you could investigate using SSD drives to speed up your reads.