问题
When i use an analyzer with edgengram (min=3, max=7, front) + term_vector=with_positions_offsets
With document having text = "CouchDB"
When i search for "couc"
My highlight is on "cou" and not "couc"
It seems my highlight is only on the minimum matching token "cou" while i would expect to be on the exact token (if possible) or at least the longest token found.
It works fine without analyzing the text with term_vector=with_positions_offsets
What's the impact of removing the term_vector=with_positions_offsets for perfomances?
回答1:
When you set term_vector=with_positions_offsets
for a specific field it means that you are storing the term vectors per document, for that field.
When it comes to highlighting, term vectors allow you to use the lucene fast vector highlighter, which is faster than the standard highlighter. The reason is that the standard highlighter doesn't have any fast way to highlight since the index doesn't contain enough information (positions and offsets). It can only re-analyze the field content, intercept offsets and positions and make highlighting based on that information. This can take quite a while, especially with long text fields.
Using term vectors you do have enough information and don't need to re-analyze the text. The downside is the size of the index, which will notably increase. I must add that since Lucene 4.2 term vectors are better compressed and stored in an optimized way though. And there's also the new PostingsHighlighter based on the ability to store offsets in the postings list, which requires even less space.
elasticsearch uses automatically the best way to make highlighting based on the information available. If term vectors are stored, it will use the fast vector highlighter, otherwise the standard one. After you reindex without term vectors, highlighting will be made using the standard highlighter. It will be slower but the index will be smaller.
Regarding ngram fields, the described behaviour is weird since fast vector highlighter should have a better support for ngram fields, thus I would expect exactly the opposite result.
回答2:
I know this question is old, but it was not yet answered completely:
There is another option that can yield to such a strange behaviour:
You have to set require_field_match
to true
if you don't want that other results of documents should influence the current document highlighting, see: http://www.elasticsearch.org/guide/reference/api/search/highlighting/
来源:https://stackoverflow.com/questions/11303660/elasticsearch-edgengram-highlight-term-vector-bad-highlights