Prevent “Too Many Clauses” on lucene query

问题

In my tests I suddenly bumped into a Too Many Clauses exception when trying to get the hits from a boolean query that consisted of a termquery and a wildcard query.

I searched around the net and on the found resources they suggest to increase the BooleanQuery.SetMaxClauseCount().
This sounds fishy to me.. To what should I up it? How can I rely that this new magic number will be sufficient for my query? How far can I increment this number before all hell breaks loose?

In general I feel this is not a solution. There must be a deeper problem..

The query was +{+companyName:mercedes +paintCode:a*} and the index has ~2.5M documents.

回答1:

the paintCode:a* part of the query is a prefix query for any paintCode beginning with an "a". Is that what you're aiming for?

Lucene expands prefix queries into a boolean query containing all the possible terms that match the prefix. In your case, apparently there are more than 1024 possible paintCodes that begin with an "a".

If it sounds to you like prefix queries are useless, you're not far from the truth.

I would suggest you change your indexing scheme to avoid using a Prefix Query. I'm not sure what you're trying to accomplish with your example, but if you want to search for paint codes by first letter, make a paintCodeFirstLetter field and search by that field.

ADDED

If you're desperate, and are willing to accept partial results, you can build your own Lucene version from source. You need to make changes to the files PrefixQuery.java and MultiTermQuery.java, both under org/apache/lucene/search. In the rewrite method of both classes, change the line

query.add(tq, BooleanClause.Occur.SHOULD);          // add to query

try {
    query.add(tq, BooleanClause.Occur.SHOULD);          // add to query
} catch (TooManyClauses e) {
    break;
}

I did this for my own project and it works.

If you really don't like the idea of changing Lucene, you could write your own PrefixQuery variant and your own QueryParser, but I don't think it's much better.

回答2:

It seems like you are using this on a field that is sort of a Keyword type (meaning there will not be multiple tokens in your data source field).

There is a suggestion here that seems pretty elegant to me: http://grokbase.com/t/lucene.apache.org/java-user/2007/11/substring-indexing-to-avoid-toomanyclauses-exception/12f7s7kzp2emktbn66tdmfpcxfya

The basic idea is to break down your term into multiple fields with increasing length until you are pretty sure you will not hit the clause limit.

Example:

Imagine a paintCode like this:

"a4c2d3"

When indexing this value, you create the following field values in your document:

[paintCode]: "a4c2d3"

[paintCode1n]: "a"

[paintCode2n]: "a4"

[paintCode3n]: "a4c"

By the time you query, the number of characters in your term decide which field to search on. This means that you will perform a prefix query only for terms with more of 3 characters, which greatly decreases the internal result count, preventing the infamous TooManyBooleanClausesException. Apparently this also speeds up the searching process.

You can easily automate a process that breaks down the terms automatically and fills the documents with values according to a name scheme during indexing.

Some issues may arise if you have multiple tokens for each field. You can find more details in the article

来源：https://stackoverflow.com/questions/614758/prevent-too-many-clauses-on-lucene-query

标签

lucene