I\'m building a Lucene Index and adding Documents.
I have a field that is multi-valued, for this example I\'ll use Categories.
An Item can have many categories,
If you use the StandardAnalyzer it is ok to have commas or spaces. But if you have another Analyzer, it depends.
Another way: You can have multiple times the same field with another category in it. Then I would recommend to use KeywordAnalyzer or let it be untokenized to have exact match of your category name.
This would be a better way to index multiValued fields per document
String categoriesForItem = getCategories(); // get "category1, category2, cat3" from a DB call
String [] categoriesForItems = categoriesForItem.split(",");
for(String cat : categoriesForItems) {
doc.add(new StringField("categories", cat , Field.Store.YES)); // doc is a Document
}
Whenever multiple fields with the same name appear in one document, both the inverted index and term vectors will logically append the tokens of the field to one another, in the order the fields were added.
Also during the analysis phase two different values will be seperated by a position increment via setPositionIncrementGap() automatically. Let me explain why this is needed.
Your field "categories" in Document D1 has two values - "foo bar" and "foo baz" Now if you were to do a phrase query "bar foo" D1 should not come up. This is ensure by adding an extra increment between two values of the same field.
If you yourself concatenate the field values and rely on the analyzer to split it into multiple values "bar foo" would return D1 which would be incorrect.