Given a Lucene search query like: +(letter:A letter:B letter:C) +(style:Capital)
, how can I tell which of the three letters actually matched any given document?
Here is a simplified and non-recursive version with Lucene.NET 4.8.
Unverified, but this should also work on Lucene.NET 3.x
IEnumerable<Term> GetHitTermsForDoc(Query query, IndexSearcher searcher, int docId)
{
//Rewrite query into simpler internal form, required for ExtractTerms
var simplifiedQuery = query.Rewrite(searcher.IndexReader);
HashSet<Term> queryTerms = new HashSet<Term>();
simplifiedQuery.ExtractTerms(queryTerms);
List<Term> hitTerms = new List<Term>();
foreach (var term in queryTerms)
{
var termQuery = new TermQuery(term);
var explanation = searcher.Explain(termQuery, docId);
if (explanation.IsMatch)
{
hitTerms.Add(term);
}
}
return hitTerms;
}
Although the sample is in c#, Lucene APIs are very similar(some upper/lower case differences). I don't think it would be hard to translate to java.
This is the usage
List<Term> terms = new List<Term>(); //will be filled with non-matched terms
List<Term> hitTerms = new List<Term>(); //will be filled with matched terms
GetHitTerms(query, searcher,docId, hitTerms,terms);
And here is the method
void GetHitTerms(Query query,IndexSearcher searcher,int docId,List<Term> hitTerms,List<Term>rest)
{
if (query is TermQuery)
{
if (searcher.Explain(query, docId).IsMatch() == true)
hitTerms.Add((query as TermQuery).GetTerm());
else
rest.Add((query as TermQuery).GetTerm());
return;
}
if (query is BooleanQuery)
{
BooleanClause[] clauses = (query as BooleanQuery).GetClauses();
if (clauses == null) return;
foreach (BooleanClause bc in clauses)
{
GetHitTerms(bc.GetQuery(), searcher, docId,hitTerms,rest);
}
return;
}
if (query is MultiTermQuery)
{
if (!(query is FuzzyQuery)) //FuzzQuery doesn't support SetRewriteMethod
(query as MultiTermQuery).SetRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
GetHitTerms(query.Rewrite(searcher.GetIndexReader()), searcher, docId,hitTerms,rest);
}
}
As answer given by @L.B, Here is the converted code of JAVA which works for me:
void GetHitTerms(Query query,IndexSearcher searcher,int docId,List<Term> hitTerms,List<Term>rest) throws IOException
{
if(query instanceof TermQuery )
{
if (searcher.explain(query, docId).isMatch())
hitTerms.add(((TermQuery) query).getTerm());
else
rest.add(((TermQuery) query).getTerm());
return;
}
if(query instanceof BooleanQuery )
{
for (BooleanClause clause : (BooleanQuery)query) {
GetHitTerms(clause.getQuery(), searcher, docId,hitTerms,rest);
}
return;
}
if (query instanceof MultiTermQuery)
{
if (!(query instanceof FuzzyQuery)) //FuzzQuery doesn't support SetRewriteMethod
((MultiTermQuery)query).setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
GetHitTerms(query.rewrite(searcher.getIndexReader()), searcher, docId,hitTerms,rest);
}
}
You could use a cached filter for each of the individual terms, and quickly check each doc id against their BitSets.
I basically used the same approach as @L.B, but updated it for usage for the newest Lucene Version 7.4.0. Note: FuzzyQuery now supports .setRewriteMethod (that's why I removed the if).
I also included handling for BoostQuerys and saved the words that were found by Lucene in a HashSet to avoid duplicates instead of the Terms.
private void saveHitWordInList(Query query, IndexSearcher indexSearcher,
int docId, HashSet<String> hitWords) throws IOException {
if (query instanceof TermQuery)
if (indexSearcher.explain(query, docId).isMatch())
hitWords.add(((TermQuery) query).getTerm().toString().split(":")[1]);
if (query instanceof BooleanQuery) {
for (BooleanClause clause : (BooleanQuery) query) {
saveHitWordInList(clause.getQuery(), indexSearcher, docId, hitWords);
}
}
if (query instanceof MultiTermQuery) {
((MultiTermQuery) query)
.setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_REWRITE);
saveHitWordInList(query.rewrite(indexSearcher.getIndexReader()),
indexSearcher, docId, hitWords);
}
if (query instanceof BoostQuery)
saveHitWordInList(((BoostQuery) query).getQuery(), indexSearcher, docId,
hitWords);
}