2 minutes for ZMQ pub/sub to connect in kubernetes

问题

I have a Kubernetes 1.18 cluster using weave as my CNI. I have a ZMQ based pub/sub app and I am often (not always) seeing it take 2 minutes before the subscriber can receive messages from the publisher. This seems to be some sort of socket timeout uniqe to my Kubernetes environment.

Here is my trivial ZMQ app example

#!/bin/env python2
import zmq, sys, time, argparse, logging, datetime, threading
from zmq.utils.monitor import recv_monitor_message

FORMAT = '%(asctime)-15s %(message)s'
logging.basicConfig(format=FORMAT)

if zmq.zmq_version_info() < (4, 0):
    raise RuntimeError("monitoring in libzmq version < 4.0 is not supported")

logging.error("libzmq-%s" % zmq.zmq_version())
if zmq.zmq_version_info() < (4, 0):
    raise RuntimeError("monitoring in libzmq version < 4.0 is not supported")

EVENT_MAP = {}
logging.error("Event names:")
for name in dir(zmq):
    if name.startswith('EVENT_'):
        value = getattr(zmq, name)
        logging.error("%21s : %4i" % (name, value))
        EVENT_MAP[value] = name


def event_monitor(monitor):
    while monitor.poll():
        evt = recv_monitor_message(monitor)
        evt.update({'description': EVENT_MAP[evt['event']]})
        logging.error("Event: {}".format(evt))
        if evt['event'] == zmq.EVENT_MONITOR_STOPPED:
            break
    monitor.close()
    logging.error("event monitor thread done!")

parser = argparse.ArgumentParser("Simple zmq pubsub example")
parser.add_argument("pub_or_sub", help="Either pub or sub")
parser.add_argument("host", help="host address to connect to if sub otherwise the address to bind to")
parser.add_argument("--port", "-p", type=int, help="The port to use", default=4567)
args = parser.parse_args()

context = zmq.Context()

if args.pub_or_sub.lower() == "sub":
    zmq_socket = context.socket(zmq.SUB)
    monitor = zmq_socket.get_monitor_socket()
    t = threading.Thread(target=event_monitor, args=(monitor,))
    t.setDaemon(True)
    t.start()

    zmq_socket.setsockopt(zmq.SUBSCRIBE, "")
    zmq_socket.connect("tcp://{}:{}".format(args.host, args.port))
    while 1:
        if zmq_socket.poll(timeout=1000):
            logging.error("Received: {}".format(zmq_socket.recv()))
        else:
            logging.error("No message available")
elif args.pub_or_sub.lower() == "pub":
    zmq_socket = context.socket(zmq.PUB)
    monitor = zmq_socket.get_monitor_socket()
    t = threading.Thread(target=event_monitor, args=(monitor,))
    t.setDaemon(True)
    t.start()
    zmq_socket.bind("tcp://{}:{}".format(args.host, args.port))
    i = 0
    while 1:
        logging.error("Sending message: {}".format(i))
        zmq_socket.send("Message {} at {}".format(i, datetime.datetime.now()))
        i += 1
        time.sleep(1.0)
else:
    raise RuntimeError("Needs to either be sub or pub nothing else allowed")

Here is how I am deploying it within Kubernetes:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pub-deployment
  labels:
    app: pub
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pub
  template:
    metadata:
      labels:
        app: pub
    spec:
      containers:
      - name: pub
        image: bagoulla/zmq:latest
        command: ["pubsub", "pub", "0.0.0.0"]
---
apiVersion: v1
kind: Service
metadata:
  name: pub
spec:
  selector:
    app: pub
  ports:
    - protocol: TCP
      port: 4567
      targetPort: 4567
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sub-deployment
  labels:
    app: sub
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sub
  template:
    metadata:
      labels:
        app: sub
    spec:
      containers:
      - name: sub
        image: bagoulla/zmq:latest
        command: ["pubsub", "sub", "pub"]

What I would expect to see from the subscriber, and I do see when running outside of Kubernetes on the same host (though still in Docker), is the following repeated in quick succession until the pub container is ready and routed:

2020-08-16 08:12:09,141 Event: {'endpoint': 'tcp://127.0.0.1:4567', 'event': 2, 'value': 115, 'description': 'EVENT_CONNECT_DELAYED'}
2020-08-16 08:12:09,141 Event: {'endpoint': 'tcp://127.0.0.1:4567', 'event': 128, 'value': 12, 'description': 'EVENT_CLOSED'}
2020-08-16 08:12:09,142 Event: {'endpoint': 'tcp://127.0.0.1:4567', 'event': 4, 'value': 183, 'description': 'EVENT_CONNECT_RETRIED'}
2020-08-16 08:12:09,328 Event: {'endpoint': 'tcp://127.0.0.1:4567', 'event': 2, 'value': 115, 'description': 'EVENT_CONNECT_DELAYED'}

However what I see in Kubernetes instead is:

│ 2020-08-16 05:54:51,724 Event: {'endpoint': 'tcp://pub:4567', 'event': 2, 'value': 115, 'description': 'EVENT_CONNECT_DELAYED'}                                                                                                            │
.... 2 minutes later....
│ 2020-08-16 05:56:59,038 No message available                                                                                                                                                                                               │
│ 2020-08-16 05:56:59,056 Event: {'endpoint': 'tcp://pub:4567', 'event': 128, 'value': 12, 'description': 'EVENT_CLOSED'}                                                                                                                    │
│ 2020-08-16 05:56:59,056 Event: {'endpoint': 'tcp://pub:4567', 'event': 4, 'value': 183, 'description': 'EVENT_CONNECT_RETRIED'}                                                                                                            │
│ 2020-08-16 05:56:59,243 Event: {'endpoint': 'tcp://pub:4567', 'event': 2, 'value': 115, 'description': 'EVENT_CONNECT_DELAYED'}                                                                                                            │
│ 2020-08-16 05:56:59,245 Event: {'endpoint': 'tcp://pub:4567', 'event': 1, 'value': 12, 'description': 'EVENT_CONNECTED'}                                                                                                                   │
│ 2020-08-16 05:56:59,286 Received: Message 127 at 2020-08-16 05:56:59.286036

Clearly something within Kubernetes is preventing the "EVENT_CLOSED" event from occurring in a timely manor. What could this be?

回答1:

The issue is that when the service comes up it essentially creates a TCP black hole where tcp connections can be started but never end up connecting. Users should set a timeout on TCP connections so that they can retry the connection until the underlying deployment or pod is up and routed properly. For ZMQ this can be done with the ZMQ_CONNECT_TIMEOUT socket option.

来源：https://stackoverflow.com/questions/63430835/2-minutes-for-zmq-pub-sub-to-connect-in-kubernetes

标签

Kubernetes

zeromq

pyzmq

weave