I have a program that consists of a master server and distributed slave servers. The slave servers send status updates to the server, and if the server hasn\'t heard from a
It turns out that the problem really was that the address was busy - the busyness was caused by some other problems in how we are handling network communications. Your inputs have helped me figure this out. Thank you.
EDIT: to be specific, the problems in handling our network communications were that these status updates would be constantly re-sent if the first failed. It was only a matter of time until we had every distributed slave trying to send its status update at the same time, which was over-saturating our network.
sysctl -w net.ipv4.tcp_timestamps=1
sysctl -w net.ipv4.tcp_tw_recycle=1
Maybe SO_REUSEADDR helps here? http://www.unixguide.net/network/socketfaq/4.5.shtml
this is just a shot in the dark : when you call connect without a bind first, the system allocates your local port, and if you have multiple threads connecting and disconnecting it could possibly try to allocate a port already in use. the kernel source file inet_connection_sock.c hints at this condition. just as an experiment try doing a bind to a local port first, making sure each bind/connect uses a different local port number.