Get matched terms from Lucene query

前端 未结 5 957
南笙
南笙 2020-12-03 12:49

Given a Lucene search query like: +(letter:A letter:B letter:C) +(style:Capital), how can I tell which of the three letters actually matched any given document?

相关标签:
5条回答
  • 2020-12-03 13:31

    Here is a simplified and non-recursive version with Lucene.NET 4.8.
    Unverified, but this should also work on Lucene.NET 3.x

    IEnumerable<Term> GetHitTermsForDoc(Query query, IndexSearcher searcher, int docId)
    {
        //Rewrite query into simpler internal form, required for ExtractTerms
        var simplifiedQuery = query.Rewrite(searcher.IndexReader);
        HashSet<Term> queryTerms = new HashSet<Term>();
        simplifiedQuery.ExtractTerms(queryTerms);
    
        List<Term> hitTerms = new List<Term>();
        foreach (var term in queryTerms)
        {
            var termQuery = new TermQuery(term);
    
            var explanation = searcher.Explain(termQuery, docId);
            if (explanation.IsMatch)
            {
                hitTerms.Add(term);
            }
        }
        return hitTerms;
    }
    
    0 讨论(0)
  • 2020-12-03 13:38

    Although the sample is in c#, Lucene APIs are very similar(some upper/lower case differences). I don't think it would be hard to translate to java.

    This is the usage

    List<Term> terms = new List<Term>();    //will be filled with non-matched terms
    List<Term> hitTerms = new List<Term>(); //will be filled with matched terms
    GetHitTerms(query, searcher,docId, hitTerms,terms);
    

    And here is the method

    void GetHitTerms(Query query,IndexSearcher searcher,int docId,List<Term> hitTerms,List<Term>rest)
    {
        if (query is TermQuery)
        {
            if (searcher.Explain(query, docId).IsMatch() == true) 
                hitTerms.Add((query as TermQuery).GetTerm());
            else
                rest.Add((query as TermQuery).GetTerm());
            return;
        }
    
        if (query is BooleanQuery)
        {
            BooleanClause[] clauses = (query as BooleanQuery).GetClauses();
            if (clauses == null) return;
    
            foreach (BooleanClause bc in clauses)
            {
                GetHitTerms(bc.GetQuery(), searcher, docId,hitTerms,rest);
            }
            return;
        }
    
        if (query is MultiTermQuery)
        {
            if (!(query is FuzzyQuery)) //FuzzQuery doesn't support SetRewriteMethod
                (query as MultiTermQuery).SetRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
    
            GetHitTerms(query.Rewrite(searcher.GetIndexReader()), searcher, docId,hitTerms,rest);
        }
    }
    
    0 讨论(0)
  • 2020-12-03 13:41

    As answer given by @L.B, Here is the converted code of JAVA which works for me:

    void GetHitTerms(Query query,IndexSearcher searcher,int docId,List<Term> hitTerms,List<Term>rest) throws IOException
        {
            if(query instanceof TermQuery )
            {
                if (searcher.explain(query, docId).isMatch())
                    hitTerms.add(((TermQuery) query).getTerm());
                else
                    rest.add(((TermQuery) query).getTerm());
                return;
            }
    
                if(query instanceof BooleanQuery )
                {
                    for (BooleanClause clause : (BooleanQuery)query) {
                        GetHitTerms(clause.getQuery(), searcher, docId,hitTerms,rest);
                }
                return;
            }
    
            if (query instanceof MultiTermQuery)
            {
                if (!(query instanceof FuzzyQuery)) //FuzzQuery doesn't support SetRewriteMethod
                    ((MultiTermQuery)query).setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
    
                GetHitTerms(query.rewrite(searcher.getIndexReader()), searcher, docId,hitTerms,rest);
            }
        }
    
    0 讨论(0)
  • 2020-12-03 13:42

    You could use a cached filter for each of the individual terms, and quickly check each doc id against their BitSets.

    0 讨论(0)
  • 2020-12-03 13:48

    I basically used the same approach as @L.B, but updated it for usage for the newest Lucene Version 7.4.0. Note: FuzzyQuery now supports .setRewriteMethod (that's why I removed the if).

    I also included handling for BoostQuerys and saved the words that were found by Lucene in a HashSet to avoid duplicates instead of the Terms.

    private void saveHitWordInList(Query query, IndexSearcher indexSearcher,
        int docId, HashSet<String> hitWords) throws IOException {
      if (query instanceof TermQuery)
        if (indexSearcher.explain(query, docId).isMatch())
          hitWords.add(((TermQuery) query).getTerm().toString().split(":")[1]);
      if (query instanceof BooleanQuery) {
        for (BooleanClause clause : (BooleanQuery) query) {
          saveHitWordInList(clause.getQuery(), indexSearcher, docId, hitWords);
        }
      }
    
      if (query instanceof MultiTermQuery) {
        ((MultiTermQuery) query)
            .setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_REWRITE);
        saveHitWordInList(query.rewrite(indexSearcher.getIndexReader()),
            indexSearcher, docId, hitWords);
      }
    
      if (query instanceof BoostQuery)
        saveHitWordInList(((BoostQuery) query).getQuery(), indexSearcher, docId,
            hitWords);
    }
    
    0 讨论(0)
提交回复
热议问题