I'll answer a bit based on how we chose to implement Lucene.Net here on Stack Overflow, and some lessons I learned along the way:
Where should I put the index? I've seen people recommend putting it on the webserver, but that would seem wasteful for a large number of webservers. Surely centralising would be better here?
- It depends on your goals here, we had a severely under-utilized web tier (~10% CPU), and an overloaded database doing FullText searching (around 60% CPU, we wanted it lower). Loading up the same index on each web tier let us utilize those machines and have a ton of redundancy, we can still lose 9 out of 10 web servers and keep the Stack Exchange network up if need be. There is a downside to this, it's very IO (read) intensive for us, and the web tier was not bought with this in mind (this is often the case at most companies). While it works fine, we'll still be upgrading our web tier to SSDs and implementing some other bits left out of the .Net port to compensate for this hardware deficiency (
NIOFSDirectory
for example).
- The other downside if we index all our databases
n
times for the web tier, but luckily we're not starved for network bandwidth and SQL server caching the results makes this a very fast delta indexing operation each time. With a large number of web servers, that alone may eliminate this option.
If the index is centralised, how would I query it, given that it just lives on the filesystem? Will I have to effectively put it on a network share that all the webservers can see?
- You can query it on a file share either way, just make sure only one is indexing at a time (
write.lock
, the directory locking mechanism will ensure this and error when you try multiple IndexWriters at once).
- Keep in mind my notes above, this is is IO intensive when a lot of readers are flying around, so you need ample bandwidth to your store, short of at least iSCSI or a fiber SAN, I'd be cautious of this approach on a high traffic (hundreds of thousands of searches a day) use.
- Another consideration is how you update/alert your web servers (or whatever tier is querying it). When you finishing an indexing pass, you'll need to re-open your
IndexReader
s to get the updated index with new documents. We use a redis messaging channel to alert whoever cares that the index has updated...any messaging mechanism would work here.
Are there any pre-existing tools that will incrementally populate a Lucene index on a schedule, pulling the data from an SQL Server database? Would I be better off rolling my own service here?
- Unfortunately there are none that I know of, but I can share with you how I approached this.
- When indexing a specific table (akin to a document in Lucene), we added a rowversion to that table. When we index we select based off the last rowversion (a timestamp datatype, pulled back as a bigint). I chose to store the last index date and last indexed rowversion on the file system via a simple .txt file for one reason: everything else in Lucene is stored there. This means if there's ever a large problem, you can just delete the folder containing the index and the next indexing pass will recover and have a fully up-to-date index, just add some code to handle nothing being there meaning "index everything".
When I query the index, should I be looking to just pull back a bunch of record id's which I then go back to the DB for the actual record, or should I be aiming to pull everything I need for the search straight out of the index?
- This really depends on your data, for us it's not really feasible to store everything in the index (nor is this recommended). What I suggest is you store the fields for your search results in the index, and by that I mean what you need to present your search results in a list, before the user clicks to go to the full [insert type here].
- Another consideration is how often your data is changing. If a lot of fields you're not searching on are changing rapidly, you'll need to re-index those rows (documents) to update your index, not only when the field you're searching on changes.
Is there value in trying to implement something like Solr in this flavour environment? If so, I'd probably give it it's own *nix VM and run it within Tomcat on that. But I'm not sure what Solr would buy me in this case.
- Sure there is, it's the centralized search you're talking about (with a high number of searches you may again hit a limit with a VM setup, keep an eye on this). We didn't do this because it introduced a lot of (we feel) unwarranted complexity in our technology stack and build process, but for a larger number of web servers it makes much more sense.
- What does it buy you? performance mainly, and a dedicated indexing server(s). Instead of
n
servers crawling a network share (competing for IO as well), they can hit a single server that only deals with requests and results over the network, not crawling the index which is a lot more data going back and forth...this would be local on the Solr server(s). Also, you're not hitting your SQL server as much since fewer servers are indexing.
- What it doesn't buy you is as much redundancy, but it's up to you how important this is. If you can operate fine on degraded search or without it, simply have your app handle that. If you can't, then a backup Solr server or more may also be a valid solution...and it is possible another software stack to maintain.