I am newbie in Lucene, and I\'m having some problems creating simple code to query a text file collection.
I tried this example, but is incompatible
Lucene is a quite big topic with a lot of classes and methods to cover, and you normally cannot use it without understanding at least some basic concepts. If you need a quickly available service, use Solr instead. If you need full control of Lucene, go on reading. I will cover some core Lucene concepts and classes, that represent them. (For information on how to read text files in memory read, for example, this article).
Whatever you are going to do in Lucene - indexing or searching - you need an analyzer. The goal of analyzer is to tokenize (break into words) and stem (get base of a word) your input text. It also throws out the most frequent words like "a", "the", etc. You can find analyzers for more then 20 languages, or you can use SnowballAnalyzer and pass language as a parameter.
To create instance of SnowballAnalyzer for English you this:
Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");
If you are going to index texts in different languages, and want to select analyzer automatically, you can use tika's LanguageIdentifier.
You need to store your index somewhere. There's 2 major possibilities for this: in-memory index, which is easy-to-try, and disk index, which is the most widespread one.
Use any of the next 2 lines:
Directory directory = new RAMDirectory(); // RAM index storage
Directory directory = FSDirectory.open(new File("/path/to/index")); // disk index storage
When you want to add, update or delete document, you need IndexWriter:
IndexWriter writer = new IndexWriter(directory, analyzer, true, new IndexWriter.MaxFieldLength(25000));
Any document (text file in your case) is a set of fields. To create document, which will hold information about your file, use this:
Document doc = new Document();
String title = nameOfYourFile;
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED)); // adding title field
String content = contentsOfYourFile;
doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED)); // adding content field
writer.addDocument(doc); // writing new document to the index
Field
constructor takes field's name, it's text and at least 2 more parameters. First is a flag, that show whether Lucene must store this field. If it equals Field.Store.YES
you will have possibility to get all your text back from the index, otherwise only index information about it will be stored.
Second parameter shows whether Lucene must index this field or not. Use Field.Index.ANALYZED
for any field you are going to search on.
Normally, you use both parameters as shown above.
Don't forget to close your IndexWriter
after the job is done:
writer.close();
Searching is a bit tricky. You will need several classes: Query
and QueryParser
to make Lucene query from the string, IndexSearcher
for actual searching, TopScoreDocCollector
to store results (it is passed to IndexSearcher
as a parameter) and ScoreDoc
to iterate through results. Next snippet shows how this all is composed:
IndexSearcher searcher = new IndexSearcher(directory);
QueryParser parser = new QueryParser(Version.LUCENE_30, "content", analyzer);
Query query = parser.parse("terms to search");
TopScoreDocCollector collector = TopScoreDocCollector.create(HOW_MANY_RESULTS_TO_COLLECT, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// `i` is just a number of document in Lucene. Note, that this number may change after document deletion
for (int i = 0; i < hits.length; i++) {
Document hitDoc = searcher.doc(hits[i].doc); // getting actual document
System.out.println("Title: " + hitDoc.get("title"));
System.out.println("Content: " + hitDoc.get("content"));
System.out.println();
}
Note second argument to the QueryParser
constructor - it is default field, i.e. field that will be searched if no qualifier was given. For example, if your query is "title:term", Lucene will search for a word "term" in field "title" of all docs, but if your query is just "term" if will search in default field, in this case - "contents". For more info see Lucene Query Syntax.
QueryParser
also takes analyzer as a last argument. This must be same analyzer as you used to index your text.
The last thing you must know is a TopScoreDocCollector.create
first parameter. It is just a number that represents how many results you want to collect. For example, if it is equal 100, Lucene will collect only first (by score) 100 results and drop the rest. This is just an act of optimization - you collect best results, and if you're not satisfied with it, you repeat search with a larger number.
Finally, don't forget to close searcher and directory to not loose system resources:
searcher.close();
directory.close();
EDIT: Also see IndexFiles demo class from Lucene 3.0 sources.