How do I index and search text files in Lucene 3.0.2?

前端 未结 3 525
时光说笑
时光说笑 2020-12-25 09:31

I am newbie in Lucene, and I\'m having some problems creating simple code to query a text file collection.

I tried this example, but is incompatible

相关标签:
3条回答
  • 2020-12-25 10:00
    package org.test;
    
    import java.io.BufferedReader;
    import java.io.File;
    import java.io.FileReader;
    import java.io.IOException;
    
    
    import org.apache.lucene.queryParser.*;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.Query;
    import org.apache.lucene.search.ScoreDoc;
    import org.apache.lucene.search.TopScoreDocCollector;
    import org.apache.lucene.store.Directory;
    import org.apache.lucene.store.FSDirectory;
    import org.apache.lucene.store.LockObtainFailedException;
    import org.apache.lucene.analysis.standard.StandardAnalyzer;
    import org.apache.lucene.index.CorruptIndexException;
    import org.apache.lucene.index.IndexWriter;
    import org.apache.lucene.store.RAMDirectory;
    import org.apache.lucene.util.Version;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    
    public class LuceneSimple {
    
     private static void addDoc(IndexWriter w, String value) throws IOException {
      Document doc = new Document();
      doc.add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
      w.addDocument(doc);
     }
    
    
    
     public static void main(String[] args) throws CorruptIndexException, LockObtainFailedException, IOException, ParseException {
    
         File dir = new File("F:/tmp/dir");
    
      StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
    
      Directory index = new RAMDirectory();
      //Directory index = FSDirectory.open(new File("lucDirHello") );
    
    
      IndexWriter w = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
    
      w.setRAMBufferSizeMB(200);
    
      System.out.println(index.getClass() + " RamBuff:" + w.getRAMBufferSizeMB() );
    
      addDoc(w, "Lucene in Action");
         addDoc(w, "Lucene for Dummies");
         addDoc(w, "Managing Gigabytes");
         addDoc(w, "The Art of Computer Science");
         addDoc(w, "Computer Science ! what is that ?");
    
    
         Long N = 0l;
    
         for( File f : dir.listFiles() ){
          BufferedReader br = new BufferedReader( new FileReader(f) );
          String line = null;
          while( ( line = br.readLine() ) != null ){
           if( line.length() < 140 ) continue;      
           addDoc(w, line);
           ++N;
          }
          br.close();
         }
    
         w.close();
    
         // 2. query
         String querystr = "Computer";
    
         Query q = new QueryParser( Version.LUCENE_30, "title", analyzer ).parse(querystr);
    
    
         //search
         int hitsPerPage = 10;
    
         IndexSearcher searcher = new IndexSearcher(index, true);
    
         TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
    
         searcher.search(q, collector);
    
         ScoreDoc[] hits = collector.topDocs().scoreDocs;
    
         System.out.println("Found " + hits.length + " hits.");
         for(int i=0;i<hits.length;++i) {
           int docId = hits[i].doc;
           Document d = searcher.doc(docId);
           System.out.println((i + 1) + ". " + d.get("title"));
         }
    
    
         searcher.close();
    
     }
    
    }
    
    0 讨论(0)
  • 2020-12-25 10:10

    I suggest you look into Solr @ http://lucene.apache.org/solr/ rather than working with lucene api

    0 讨论(0)
  • 2020-12-25 10:11

    Lucene is a quite big topic with a lot of classes and methods to cover, and you normally cannot use it without understanding at least some basic concepts. If you need a quickly available service, use Solr instead. If you need full control of Lucene, go on reading. I will cover some core Lucene concepts and classes, that represent them. (For information on how to read text files in memory read, for example, this article).

    Whatever you are going to do in Lucene - indexing or searching - you need an analyzer. The goal of analyzer is to tokenize (break into words) and stem (get base of a word) your input text. It also throws out the most frequent words like "a", "the", etc. You can find analyzers for more then 20 languages, or you can use SnowballAnalyzer and pass language as a parameter.
    To create instance of SnowballAnalyzer for English you this:

    Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");
    

    If you are going to index texts in different languages, and want to select analyzer automatically, you can use tika's LanguageIdentifier.

    You need to store your index somewhere. There's 2 major possibilities for this: in-memory index, which is easy-to-try, and disk index, which is the most widespread one.
    Use any of the next 2 lines:

    Directory directory = new RAMDirectory();   // RAM index storage
    Directory directory = FSDirectory.open(new File("/path/to/index"));  // disk index storage
    

    When you want to add, update or delete document, you need IndexWriter:

    IndexWriter writer = new IndexWriter(directory, analyzer, true, new IndexWriter.MaxFieldLength(25000));
    

    Any document (text file in your case) is a set of fields. To create document, which will hold information about your file, use this:

    Document doc = new Document();
    String title = nameOfYourFile;
    doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));  // adding title field
    String content = contentsOfYourFile;
    doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED)); // adding content field
    writer.addDocument(doc);  // writing new document to the index
    

    Field constructor takes field's name, it's text and at least 2 more parameters. First is a flag, that show whether Lucene must store this field. If it equals Field.Store.YES you will have possibility to get all your text back from the index, otherwise only index information about it will be stored.
    Second parameter shows whether Lucene must index this field or not. Use Field.Index.ANALYZED for any field you are going to search on.
    Normally, you use both parameters as shown above.

    Don't forget to close your IndexWriter after the job is done:

    writer.close();
    

    Searching is a bit tricky. You will need several classes: Query and QueryParser to make Lucene query from the string, IndexSearcher for actual searching, TopScoreDocCollector to store results (it is passed to IndexSearcher as a parameter) and ScoreDoc to iterate through results. Next snippet shows how this all is composed:

    IndexSearcher searcher = new IndexSearcher(directory);
    QueryParser parser = new QueryParser(Version.LUCENE_30, "content", analyzer);
    Query query = parser.parse("terms to search");
    TopScoreDocCollector collector = TopScoreDocCollector.create(HOW_MANY_RESULTS_TO_COLLECT, true);
    searcher.search(query, collector);
    
    ScoreDoc[] hits = collector.topDocs().scoreDocs;
    // `i` is just a number of document in Lucene. Note, that this number may change after document deletion 
    for (int i = 0; i < hits.length; i++) {
        Document hitDoc = searcher.doc(hits[i].doc);  // getting actual document
        System.out.println("Title: " + hitDoc.get("title"));
        System.out.println("Content: " + hitDoc.get("content"));
        System.out.println();
    }
    

    Note second argument to the QueryParser constructor - it is default field, i.e. field that will be searched if no qualifier was given. For example, if your query is "title:term", Lucene will search for a word "term" in field "title" of all docs, but if your query is just "term" if will search in default field, in this case - "contents". For more info see Lucene Query Syntax.
    QueryParser also takes analyzer as a last argument. This must be same analyzer as you used to index your text.

    The last thing you must know is a TopScoreDocCollector.create first parameter. It is just a number that represents how many results you want to collect. For example, if it is equal 100, Lucene will collect only first (by score) 100 results and drop the rest. This is just an act of optimization - you collect best results, and if you're not satisfied with it, you repeat search with a larger number.

    Finally, don't forget to close searcher and directory to not loose system resources:

    searcher.close();
    directory.close();
    

    EDIT: Also see IndexFiles demo class from Lucene 3.0 sources.

    0 讨论(0)
提交回复
热议问题