Node cannot join Swarm Cluster

问题

I have 3 VM's. They all have docker 1.12 and they are running on centos7. All the ports are opened and the vm's are able to ping eachother I started my cluster with

docker swarm init --advertise-addr 192.168.140.12

Docker info showed me:

Swarm: active
 NodeID: 0drcj2nku1mv8t16fxva48edxx
 Is Manager: true
 ClusterID: cchn0yzospwoe1h9f55d7omxx
 Managers: 1
 Nodes: 1

Now I try to join nodes (other vms) to the cluster. I use the command which was recommended after starting my manager.

docker swarm join \
     --token SWMTKN-1-48ythur5k6ckkz90ttlprw37p9z3ldclws51qirw5wdyfmvevr-3sb2t66b2fj6e4dhmfo1vavxx \
     192.168.140.12:2377

But I got:

Error response from daemon: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use "docker info" command to see the current swarm status of your node.

Docker info showed me:

Swarm: pending
 NodeID:
 Error: rpc error: code = 1 desc = context canceled
 Is Manager: false
 Node Address: 192.168.140.14

On manager of cluster:

# netstat -tulpn | grep docker
tcp6       0      0 :::2377                 :::*                    LISTEN      1602/dockerd
tcp6       0      0 :::7946                 :::*                    LISTEN      1602/dockerd
tcp6       0      0 :::8080                 :::*                    LISTEN      3398/docker-proxy
tcp6       0      0 :::32768                :::*                    LISTEN      3199/docker-proxy
tcp6       0      0 :::32769                :::*                    LISTEN      3219/docker-proxy
tcp6       0      0 :::32770                :::*                    LISTEN      3341/docker-proxy
tcp6       0      0 :::32771                :::*                    LISTEN      3436/docker-proxy
tcp6       0      0 :::2375                 :::*                    LISTEN      1602/dockerd
udp6       0      0 :::7946                 :::*                                1602/dockerd

How can I debug this issue or did I forgot to perform some important step? Do the servers need ssh-access to each other? Thanks

logs on node:

Aug  8 09:50:24 localhost dockerd: time="2016-08-08T09:50:24.393432145-04:00" level=error msg="Handler for POST /v1.24/swarm/leave returned error: This node is not part of swarm"
Aug  8 09:51:01 localhost su: (to root) worker1 on pts/1
Aug  8 09:51:34 localhost dockerd: time="2016-08-08T09:51:34.384408514-04:00" level=error msg="Handler for POST /v1.24/swarm/join returned error: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use \"docker info\" command to see the current swarm status of your node."
Aug  8 09:51:40 localhost su: (to root) worker1 on pts/1
Aug  8 09:52:47 localhost dhclient[1277]: DHCPREQUEST on eno16777736 to 192.168.140.254 port 67 (xid=0x11f8fba8)
Aug  8 09:52:47 localhost dhclient[1277]: DHCPACK from 192.168.140.254 (xid=0x11f8fba8)
Aug  8 09:52:47 localhost NetworkManager[953]: <info>    address 192.168.140.13
Aug  8 09:52:47 localhost NetworkManager[953]: <info>    plen 24 (255.255.255.0)
Aug  8 09:52:47 localhost NetworkManager[953]: <info>    gateway 192.168.140.2
Aug  8 09:52:47 localhost NetworkManager[953]: <info>    server identifier 192.168.140.254
Aug  8 09:52:47 localhost NetworkManager[953]: <info>    lease time 1800
Aug  8 09:52:47 localhost NetworkManager[953]: <info>    nameserver '192.168.140.2'
Aug  8 09:52:47 localhost NetworkManager[953]: <info>    domain name 'localdomain'
Aug  8 09:52:47 localhost NetworkManager[953]: <info>  (eno16777736): DHCPv4 state changed bound -> bound
Aug  8 09:52:47 localhost dbus[878]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Aug  8 09:52:47 localhost dbus-daemon: dbus[878]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Aug  8 09:52:47 localhost systemd: Starting Network Manager Script Dispatcher Service...
Aug  8 09:52:47 localhost dhclient[1277]: bound to 192.168.140.13 -- renewal in 713 seconds.
Aug  8 09:52:47 localhost dbus[878]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Aug  8 09:52:47 localhost dbus-daemon: dbus[878]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Aug  8 09:52:47 localhost nm-dispatcher: Dispatching action 'dhcp4-change' for eno16777736
Aug  8 09:52:47 localhost systemd: Started Network Manager Script Dispatcher Service.

Sometimes warnings:

level=warning msg="failed to retrieve remote root CA certificate: rpc error: code = 1 desc = context canceled

回答1:

Maybe you were using a http proxy.

You can use the following command to see what dockerd is doing.

# strace -Fp `pidof dockerd` 2>&1 |grep -v futex |grep -v epoll_wait |grep -v pselect

回答2:

The hostnames on all my vm's were: localhost.localdomain. I changed the hostnames in /etc/hosts on each server and rebooted. Now I'm able to create my swarm cluster and add nodes successfully.

回答3:

As explained by wenjianhn, make sure that you did not configure on your worker-node an http-proxy for Docker (as described here). Indeed, Swarm nodes communicate over HTTP (default port 2377); so if you configured an http-proxy, your worker-node will use the configured http-proxy, even if the manager-node sits in your LAN.

Also, make sure that no firewall is blocking the traffic on port 2377:

user@workernode$ telnet ip-of-manager 2377

If you can't open a telnet connection on port 2377, it means that this port is blocked by a firewall (either the worker node's firewall, either the manager's one).

回答4:

I had same problem, and solved by sync each worker node's date same as master node date.

pi@workernode$sudo date --set="$(username@masternode date)"

after this, try update the worker node, and it should work.

回答5:

If none of the below solutions have worked. Try disabling firewall on the master and see if it works.

回答6:

I had a similar problem. I checked that there is no proxy configured and no firewall in the way but "docker swarm join" did not work. I noticed that there were managers listed in "docker info" that were no longer present. "docker node ls" did not show anything strange.

Eventually I solved my issue by doing "service docker restart" on the node addressed by "docker swarm join". Obviously some internal bookkeeping within the docker daemon got out of sync.

来源：https://stackoverflow.com/questions/38825792/node-cannot-join-swarm-cluster

标签

Docker

docker-swarm