Is it possible to iterate through documents stored in Lucene Index?

前端 未结 5 1332
栀梦
栀梦 2020-11-29 05:29

I have some documents stored in a Lucene index with a docId field. I want to get all docIds stored in the index. There is also a problem. Number of documents is about 300 00

相关标签:
5条回答
  • 2020-11-29 05:52

    There is a query class named MatchAllDocsQuery, I think it can be used in this case:

    Query query = new MatchAllDocsQuery();
    TopDocs topDocs = getIndexSearcher.search(query, RESULT_LIMIT);
    
    0 讨论(0)
  • 2020-11-29 05:55

    Document numbers (or ids) will be subsequent numbers from 0 to IndexReader.maxDoc()-1. These numbers are not persistent and are valid only for opened IndexReader. You could check if the document is deleted with IndexReader.isDeleted(int documentNumber) method

    0 讨论(0)
  • 2020-11-29 05:56
    IndexReader reader = // create IndexReader
    for (int i=0; i<reader.maxDoc(); i++) {
        if (reader.isDeleted(i))
            continue;
    
        Document doc = reader.document(i);
        String docId = doc.get("docId");
    
        // do something with docId here...
    }
    
    0 讨论(0)
  • 2020-11-29 06:02

    If you use .document(i) as in above examples and skip over deleted documents be careful if you use this method for paginating results. i.e.: You have a 10 docs/per page list and you need to get the docs. for page 6. Your input might be something like this: offset=60,count = 10 (documents from 60 to 70).

        IndexReader reader = // create IndexReader
    for (int i=offset; i<offset + 10; i++) {
        if (reader.isDeleted(i))
            continue;
    
        Document doc = reader.document(i);
        String docId = doc.get("docId");
    }
    

    You will have some problems with the deleted ones because you should not start from offset=60, but from offset=60 + the number of deleted documents that appear before 60.

    An alternative I found is something like this:

        is = getIndexSearcher(); //new IndexSearcher(indexReader)
        //get all results without any conditions attached. 
        Term term = new Term([[any mandatory field name]], "*");
        Query query = new WildcardQuery(term);
    
        topCollector = TopScoreDocCollector.create([[int max hits to get]], true);
        is.search(query, topCollector);
    
       TopDocs topDocs = topCollector.topDocs(offset, count);
    

    note: replace text between [[ ]] with own values. Ran this on large index with 1.5million entries and got random 10 results in less than a second. Agree is slower but at least you can ignore deleted documents if you need pagination.

    0 讨论(0)
  • 2020-11-29 06:04

    Lucene 4

    Bits liveDocs = MultiFields.getLiveDocs(reader);
    for (int i=0; i<reader.maxDoc(); i++) {
        if (liveDocs != null && !liveDocs.get(i))
            continue;
    
        Document doc = reader.document(i);
    }
    

    See LUCENE-2600 on this page for details: https://lucene.apache.org/core/4_0_0/MIGRATE.html

    0 讨论(0)
提交回复
热议问题