Optimize API for reducing the segments and eliminating ES deleted docs not working

问题

This is in continuation of my previous question Does huge number of deleted doc count affects ES query performance related to deleted docs in my ES index.

As pointed in the answer, I used optimize API as I am using the ES 1.X version where force merge API is not available but after reading about optimize API github link(provided earlier as couldn't find it on ES site) by Say Bannon founder of elastic, looks like it does the same work.

I got the success message for my index after running the optimize API, but I don't see total count of deleted docs decreasing and I am worried as when I checked the segments of my index using segments API, I see there are more than 25 segments for each shard and every shard is holding 250-1 gb of data in memory and almost 500k docs, while I see there are some shards where there is few deleted docs.

So my question are:

My index is having multiple shards across multiple data nodes and when I ran optimize API using only 1 node URL, then does it only merges the segments on that node?
In segment API result it shows the node-id like "node": "f2hsqeamadnaskda", while I am using KOPF plugin and have custom names for my data nodes, so How can I relate this cryptic node name to my human readable node name to identify whether statement 1 is correct or not?
As there is no documentation available on optimize API, is it possible to merge segments on all shards across all nodes in single shot? and do I need to make index read-only before applying it?

回答1:

It merges the segment based on the segment state, size and various other params, also it merges the segments of all the shards of an index. Looks like in your example you have a huge number of segments that are not being picked up optimize API, which makes you think that merge works on a particular node shard. You can give additional query param max_num_segments={your desired no of segments} and see it should reduce the segments of a shard to the given number.
Node id you can find using API with http://:9200/_cat/nodes?v&h=id,ip,name , it gives output in below format

id ip name
SEax 10.10.10.94 foo
f2hs 10.10.10.95 bar

Here ids in the first column are what you are seeing as cryptic in the segment  
API response, its not completed in above API but initial four character are 
unique and you can find them in segment API result.

Yes, it is possible to merge segments on all shards across all nodes in a single shot, As mentioned in my first point just mention {your desired no of segments} as 1. Also, it is recommended to make index read-only to avoid having the deleted documents in huge segments, ie more than 5 GB, which are never picked up by optimize API and these deleted documents will never be cleaned up, As explained in https://www.elastic.co/guide/en/elasticsearch/reference/6.8/indices-forcemerge.html. If your overall index size is small and you are sure 1 segment size will never cross 5 GB, then no need to put index on read-only mode if you are OK with some performance degradation as segment merge is a costly operation. But sometime stopping live traffic(indexing) isn't an option, hence would advise you to perform when there is the least load in the cluster.

回答2:

force_merge or optimize call gets applied to entire index, you dont have to do them at node level.

You can use _cat api to find out nodeid:Ip mapping.In case your version does not support _cat api ( < 1.0) , use cluster state api

回答3:

@Nirmal has answered your first two questions, so:

As there is no documentation available on optimize API, is it possible to merge segments on all shards across all nodes in single shot? and do I need to make index read-only before applying it?

There is documentation available for 1.x: https://www.elastic.co/guide/en/elasticsearch/reference/1.7/indices-optimize.html. You are probably looking for calls like these:

GET <index_pattern>/_cat/segments: List all segments in all the shards (can be thousands). Also lists deleted docs.
POST <index_pattern>/_optimize?max_num_segments=1: Force merge all segments to 1 single segment per shard. Do this when the index is no longer being written to. It helps to reduce load on CPU/RAM on the data nodes.
POST <index_pattern>/_optimize?only_expunge_deletes=true: only remove deleted docs

Finally, you can use * as <index_pattern> to just do all indices on the whole cluster.

来源：https://stackoverflow.com/questions/60204556/optimize-api-for-reducing-the-segments-and-eliminating-es-deleted-docs-not-worki

标签

ElasticSearch

lucene

segment

elasticsearch-1.7.5