问题
Regularly the past days our ES 7.4 cluster (4 nodes) is giving read timeouts and is getting slower and slower when it comes to running certain management commands. Before that it has been running for more than a year without any trouble. For instance /_cat/nodes was taking 2 minutes yesterday to execute, today it is already taking 4 minutes. Server loads are low, memory usage seems fine, not sure where to look further.
Using the opster.com online tool I managed to get some hint that the management queue size is high, however when executing the suggested commands there to investigate I don't see anything out of the ordinary other than that the command takes long to give a result:
$ curl "http://127.0.0.1:9201/_cat/thread_pool/management?v&h=id,active,rejected,completed,node_id"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 345 100 345 0 0 2 0 0:02:52 0:02:47 0:00:05 90
id active rejected completed node_id
JZHgYyCKRyiMESiaGlkITA 1 0 4424211 elastic7-1
jllZ8mmTRQmsh8Sxm8eDYg 1 0 4626296 elastic7-4
cI-cn4V3RP65qvE3ZR8MXQ 5 0 4666917 elastic7-2
TJJ_eHLIRk6qKq_qRWmd3w 1 0 4592766 elastic7-3
How can I debug this / solve this? Thanks in advance.
回答1:
If you notice your elastic7-2
node is having 5 active requests in the management queue, which is really high, As the management queue capacity itself is just 5, and it's used only for very few operations(Management, not search/index).
You can have a look at opster's article on various threadpool and threadpools in elasticsearch for further read.
来源:https://stackoverflow.com/questions/65034085/something-inside-elasticsearch-7-4-cluster-is-getting-slower-and-slower-with-rea