Transport Endpoint Not Connected - Mesos Slave / Master

后端未结

关注

 4  1423

I\'m trying to connect a Mesos slave to its master. Whenver the slave tries to connect to the master, I get the following message:

I0806 16:39:59.090845   93


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  刺人心        
                
              
                            
                2021-01-04 00:24
              
            
            
                                                                       
I've run into this error in the logs when upgrading mesos versions (e.g. 0.20.0 -> 0.27.0).  Sometimes the data from the previous version is incompatible with other versions.

Here is how I remedied it:

First ensure all nodes have the mesos-master service stopped:

sudo service mesos-master stop


Then clear out all potential old data:


Remove $MESOS_WORK_DIR (/var/mesos in my case):

sudo rm -rf /var/mesos

Clear our mesos data in ZooKeeper:

$ zkCli.sh
WatchedEvent state:SyncConnected type:None path:null
[zk: localhost:2181(CONNECTED) 0] rmr /mesos
[zk: localhost:2181(CONNECTED) 0] quit
Quitting...



After doing these steps I started the mesos-master service on all nodes and it came back online.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  花落未央        
                
              
                            
                2021-01-04 00:28
              
            
            
                                                                       
I had a similar problem.
My slave logs would be filled with
    E0812 15:58:04.017990  2193 socket.hpp:107] Shutdown failed on fd=13: Transport endpoint is not connected [107]

My master would have
    F0120 20:45:48.025610 12116 master.cpp:1083] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins

And the master would die, and a new election would occur, the killed master would be restarted by upstart (I am on a Centos 6 box) and be added into the pool of potential masters. Thus my elected master would daisy chain around my master nodes. Many restarts of masters and slaves did nothing the problem would consistently return within 1 minute of master election.
The solution for me came from a this stackoverflow question (thanks) and a hint in a github gist note.
The gist of it is /etc/default/mesos-master must specify a quorum number (it needs to be correct for the number of mesos masters, in my case 3)
    MESOS_QUORUM=2

This seems odd to me as I have the same information in the file /etc/mesos-master/quorum
But I added it to /etc/default/mesos-master restarted the mesos-masters and slaves and the problem has not returned.
I hope this helps you.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  太阳男子        
                
              
                            
                2021-01-04 00:38
              
            
            
                                                                       
Run the slave with --ip=10.129.62.49 instead
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  既然无缘        
                
              
                            
                2021-01-04 00:40
              
            
            
                                                                       
I0806 16:39:59.091747   940 master.cpp:1006] Slave 20150806-163941-1027506442-5050-921-S3 at slave(1)@127.0.1.1:5051 (debian) disconnected


This is the error hint.

Your slave expose the wrong IP.

Append --ip=10.129.62.49 to the slave command and it works.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复