Are there any links/resources towards performance benchmarks for Lucene/Solr on large datasets. Data sets above the range of 500GB ~ 5TB
Thanks
Lucene committer Mike McCandless runs benchmarks on a regular basis to track down performances improvements and regressions. They are made with Wikipedia exports, which might be a little bit smaller than what you are looking for.
But the performance doesn't depend so much on the input size, but rather on the number of documents and unique terms. If you already have some data similar to what you will need to index, I would recommend you check out Mike's test tool, adapt it to your needs, and run it with your own dataset and hardware to try to find out what kind of performance numbers you can expect.