Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights

一曲冷凌霜 提交于 2019-12-21 12:39:55

问题


When i use an analyzer with edgengram (min=3, max=7, front) + term_vector=with_positions_offsets

With document having text = "CouchDB"

When i search for "couc"

My highlight is on "cou" and not "couc"


It seems my highlight is only on the minimum matching token "cou" while i would expect to be on the exact token (if possible) or at least the longest token found.

It works fine without analyzing the text with term_vector=with_positions_offsets

What's the impact of removing the term_vector=with_positions_offsets for perfomances?


回答1:


When you set term_vector=with_positions_offsets for a specific field it means that you are storing the term vectors per document, for that field.

When it comes to highlighting, term vectors allow you to use the lucene fast vector highlighter, which is faster than the standard highlighter. The reason is that the standard highlighter doesn't have any fast way to highlight since the index doesn't contain enough information (positions and offsets). It can only re-analyze the field content, intercept offsets and positions and make highlighting based on that information. This can take quite a while, especially with long text fields.

Using term vectors you do have enough information and don't need to re-analyze the text. The downside is the size of the index, which will notably increase. I must add that since Lucene 4.2 term vectors are better compressed and stored in an optimized way though. And there's also the new PostingsHighlighter based on the ability to store offsets in the postings list, which requires even less space.

elasticsearch uses automatically the best way to make highlighting based on the information available. If term vectors are stored, it will use the fast vector highlighter, otherwise the standard one. After you reindex without term vectors, highlighting will be made using the standard highlighter. It will be slower but the index will be smaller.

Regarding ngram fields, the described behaviour is weird since fast vector highlighter should have a better support for ngram fields, thus I would expect exactly the opposite result.




回答2:


I know this question is old, but it was not yet answered completely:

There is another option that can yield to such a strange behaviour:

You have to set require_field_match to true if you don't want that other results of documents should influence the current document highlighting, see: http://www.elasticsearch.org/guide/reference/api/search/highlighting/



来源:https://stackoverflow.com/questions/11303660/elasticsearch-edgengram-highlight-term-vector-bad-highlights

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!