问题
I have a Kubernetes 1.18 cluster using weave as my CNI. I have a ZMQ based pub/sub app and I am often (not always) seeing it take 2 minutes before the subscriber can receive messages from the publisher. This seems to be some sort of socket timeout uniqe to my Kubernetes environment.
Here is my trivial ZMQ app example
#!/bin/env python2
import zmq, sys, time, argparse, logging, datetime, threading
from zmq.utils.monitor import recv_monitor_message
FORMAT = '%(asctime)-15s %(message)s'
logging.basicConfig(format=FORMAT)
if zmq.zmq_version_info() < (4, 0):
raise RuntimeError("monitoring in libzmq version < 4.0 is not supported")
logging.error("libzmq-%s" % zmq.zmq_version())
if zmq.zmq_version_info() < (4, 0):
raise RuntimeError("monitoring in libzmq version < 4.0 is not supported")
EVENT_MAP = {}
logging.error("Event names:")
for name in dir(zmq):
if name.startswith('EVENT_'):
value = getattr(zmq, name)
logging.error("%21s : %4i" % (name, value))
EVENT_MAP[value] = name
def event_monitor(monitor):
while monitor.poll():
evt = recv_monitor_message(monitor)
evt.update({'description': EVENT_MAP[evt['event']]})
logging.error("Event: {}".format(evt))
if evt['event'] == zmq.EVENT_MONITOR_STOPPED:
break
monitor.close()
logging.error("event monitor thread done!")
parser = argparse.ArgumentParser("Simple zmq pubsub example")
parser.add_argument("pub_or_sub", help="Either pub or sub")
parser.add_argument("host", help="host address to connect to if sub otherwise the address to bind to")
parser.add_argument("--port", "-p", type=int, help="The port to use", default=4567)
args = parser.parse_args()
context = zmq.Context()
if args.pub_or_sub.lower() == "sub":
zmq_socket = context.socket(zmq.SUB)
monitor = zmq_socket.get_monitor_socket()
t = threading.Thread(target=event_monitor, args=(monitor,))
t.setDaemon(True)
t.start()
zmq_socket.setsockopt(zmq.SUBSCRIBE, "")
zmq_socket.connect("tcp://{}:{}".format(args.host, args.port))
while 1:
if zmq_socket.poll(timeout=1000):
logging.error("Received: {}".format(zmq_socket.recv()))
else:
logging.error("No message available")
elif args.pub_or_sub.lower() == "pub":
zmq_socket = context.socket(zmq.PUB)
monitor = zmq_socket.get_monitor_socket()
t = threading.Thread(target=event_monitor, args=(monitor,))
t.setDaemon(True)
t.start()
zmq_socket.bind("tcp://{}:{}".format(args.host, args.port))
i = 0
while 1:
logging.error("Sending message: {}".format(i))
zmq_socket.send("Message {} at {}".format(i, datetime.datetime.now()))
i += 1
time.sleep(1.0)
else:
raise RuntimeError("Needs to either be sub or pub nothing else allowed")
Here is how I am deploying it within Kubernetes:
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: pub-deployment
labels:
app: pub
spec:
replicas: 1
selector:
matchLabels:
app: pub
template:
metadata:
labels:
app: pub
spec:
containers:
- name: pub
image: bagoulla/zmq:latest
command: ["pubsub", "pub", "0.0.0.0"]
---
apiVersion: v1
kind: Service
metadata:
name: pub
spec:
selector:
app: pub
ports:
- protocol: TCP
port: 4567
targetPort: 4567
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: sub-deployment
labels:
app: sub
spec:
replicas: 1
selector:
matchLabels:
app: sub
template:
metadata:
labels:
app: sub
spec:
containers:
- name: sub
image: bagoulla/zmq:latest
command: ["pubsub", "sub", "pub"]
What I would expect to see from the subscriber, and I do see when running outside of Kubernetes on the same host (though still in Docker), is the following repeated in quick succession until the pub container is ready and routed:
2020-08-16 08:12:09,141 Event: {'endpoint': 'tcp://127.0.0.1:4567', 'event': 2, 'value': 115, 'description': 'EVENT_CONNECT_DELAYED'}
2020-08-16 08:12:09,141 Event: {'endpoint': 'tcp://127.0.0.1:4567', 'event': 128, 'value': 12, 'description': 'EVENT_CLOSED'}
2020-08-16 08:12:09,142 Event: {'endpoint': 'tcp://127.0.0.1:4567', 'event': 4, 'value': 183, 'description': 'EVENT_CONNECT_RETRIED'}
2020-08-16 08:12:09,328 Event: {'endpoint': 'tcp://127.0.0.1:4567', 'event': 2, 'value': 115, 'description': 'EVENT_CONNECT_DELAYED'}
However what I see in Kubernetes instead is:
│ 2020-08-16 05:54:51,724 Event: {'endpoint': 'tcp://pub:4567', 'event': 2, 'value': 115, 'description': 'EVENT_CONNECT_DELAYED'} │
.... 2 minutes later....
│ 2020-08-16 05:56:59,038 No message available │
│ 2020-08-16 05:56:59,056 Event: {'endpoint': 'tcp://pub:4567', 'event': 128, 'value': 12, 'description': 'EVENT_CLOSED'} │
│ 2020-08-16 05:56:59,056 Event: {'endpoint': 'tcp://pub:4567', 'event': 4, 'value': 183, 'description': 'EVENT_CONNECT_RETRIED'} │
│ 2020-08-16 05:56:59,243 Event: {'endpoint': 'tcp://pub:4567', 'event': 2, 'value': 115, 'description': 'EVENT_CONNECT_DELAYED'} │
│ 2020-08-16 05:56:59,245 Event: {'endpoint': 'tcp://pub:4567', 'event': 1, 'value': 12, 'description': 'EVENT_CONNECTED'} │
│ 2020-08-16 05:56:59,286 Received: Message 127 at 2020-08-16 05:56:59.286036
Clearly something within Kubernetes is preventing the "EVENT_CLOSED" event from occurring in a timely manor. What could this be?
回答1:
The issue is that when the service comes up it essentially creates a TCP black hole where tcp connections can be started but never end up connecting. Users should set a timeout on TCP connections so that they can retry the connection until the underlying deployment or pod is up and routed properly. For ZMQ this can be done with the ZMQ_CONNECT_TIMEOUT
socket option.
来源:https://stackoverflow.com/questions/63430835/2-minutes-for-zmq-pub-sub-to-connect-in-kubernetes