ElasticSearch: Unassigned Shards, how to fix?

前端未结

关注

 24  1061

I have an ES cluster with 4 nodes:

number_of_replicas: 1
search01 - master: false, data: false
search02 - master: true, data: true
search03 - master: false,


                      
              相关标签:


      
      
        
          24条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  小蘑菇        
                
              
                            
                2020-12-04 05:39
              
            
            
                                                                       
I've stuck today with the same issue of shards allocation. The script that
W. Andrew Loe III has proposed in his answer didn't work for me, so I modified it a little and it finally worked:

#!/usr/bin/env bash

# The script performs force relocation of all unassigned shards, 
# of all indices to a specified node (NODE variable)

ES_HOST="<elasticsearch host>"
NODE="<node name>"

curl ${ES_HOST}:9200/_cat/shards > shards
grep "UNASSIGNED" shards > unassigned_shards

while read LINE; do
  IFS=" " read -r -a ARRAY <<< "$LINE"
  INDEX=${ARRAY[0]}
  SHARD=${ARRAY[1]}

  echo "Relocating:"
  echo "Index: ${INDEX}"
  echo "Shard: ${SHARD}"
  echo "To node: ${NODE}"

  curl -s -XPOST "${ES_HOST}:9200/_cluster/reroute" -d "{
    \"commands\": [
       {
         \"allocate\": {
           \"index\": \"${INDEX}\",
           \"shard\": ${SHARD},
           \"node\": \"${NODE}\",
           \"allow_primary\": true
         }
       }
     ]
  }"; echo
  echo "------------------------------"
done <unassigned_shards

rm shards
rm unassigned_shards

exit 0


Now, I'm not kind of a Bash guru, but the script really worked for my case. Note, that you'll need to specify appropriate values for "ES_HOST" and "NODE" variables.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  鱼传尺愫        
                
              
                            
                2020-12-04 05:41
              
            
            
                                                                       
Another possible reason for unassigned shards is that your cluster is running more than one version of the Elasticsearch binary.


  shard replication from the more recent version to the previous
  versions will not work


This can be a root cause for unassigned shards. 

Elastic Documentation - Rolling Upgrade Process
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2020-12-04 05:42
              
            
            
                                                                       
I had the same problem but the root cause was a difference in version numbers (1.4.2 on two nodes (with problems) and 1.4.4 on two nodes (ok)). The first and second answers (setting "index.routing.allocation.disable_allocation" to false and setting "cluster.routing.allocation.enable" to "all") did not work. 

However, the answer by @Wilfred Hughes (setting "cluster.routing.allocation.enable" to "all" using transient) gave me an error with the following statement:


  [NO(target node version [1.4.2] is older than source node version
  [1.4.4])]


After updating the old nodes to 1.4.4 these nodes started to resnc with the other good nodes.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  春和景丽        
                
              
                            
                2020-12-04 05:42
              
            
            
                                                                       
I had two indices with unassigned shards that didn't seem to be self-healing.  I eventually resolved this by temporarily adding an extra data-node^[1].  After the indices became healthy and everything stabilized to green, I removed the extra node and the system was able to rebalance (again) and settle on a healthy state.

It's a good idea to avoid killing multiple data nodes at once (which is how I got into this state).  Likely, I had failed to preserve any copies/replicas for at least one of the shards.  Luckily, Kubernetes kept the disk storage around, and reused it when I relaunched the data-node.



...Some time has passed...

Well, this time just adding a node didn't seem to be working (after waiting several minutes for something to happen), so I started poking around in the REST API.

GET /_cluster/allocation/explain


This showed my new node with "decision": "YES".

By the way, all of the pre-existing nodes had "decision": "NO" due to "the node is above the low watermark cluster setting".  So this was probably a different case than the one I had addressed previously.

Then I made the following simple POST^[2] with no body, which kicked things into gear...

POST /_cluster/reroute




Other notes:


Very helpful: https://datadoghq.com/blog/elasticsearch-unassigned-shards
Something else that may work. Set cluster_concurrent_rebalance to 0, then to null -- as I demonstrate here.




^{^[1] Pretty easy to do in Kubernetes if you have enough headroom: just scale out the stateful set via the dashboard.}

^{^[2] Using the Kibana "Dev Tools" interface, I didn't have to bother with SSH/exec shells.}
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  面向向阳花        
                
              
                            
                2020-12-04 05:43
              
            
            
                                                                       
For me, this was resolved by running this from the dev console:  "POST /_cluster/reroute?retry_failed"

.....

I started by looking at the index list to see which indices were red and then ran 

"get /_cat/shards?h=[INDEXNAME],shard,prirep,state,unassigned.reason"

and saw that it had shards stuck in ALLOCATION_FAILED state, so running the retry above caused them to re-try the allocation.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  深忆病人        
                
              
                            
                2020-12-04 05:43
              
            
            
                                                                       
I tried several of the suggestions above and unfortunately none of them worked. We have a "Log" index in our lower environment where apps write their errors. It is a single node cluster. What solved it for me was checking the YML configuration file for the node and seeing that it still had the default setting "gateway.expected_nodes: 2". This was overriding any other settings we had. Whenever we would create an index on this node it would try to spread 3 out of 5 shards to the phantom 2nd node. These would therefore appear as unassigned and they could never be moved to the 1st and only node.

The solution was editing the config, changing the setting "gateway.expected_nodes" to 1, so it would quit looking for its never-to-be-found brother in the cluster, and restarting the Elastic service instance. Also, I had to delete the index, and create a new one. After creating the index, the shards all showed up on the 1st and only node, and none were unassigned.

# Set how many nodes are expected in this cluster. Once these N nodes
# are up (and recover_after_nodes is met), begin recovery process immediately
# (without waiting for recover_after_time to expire):
#
# gateway.expected_nodes: 2
gateway.expected_nodes: 1

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
3
4
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复