问题
I've encountered a strange bug with TCP sockets. It seems that SO_KEEPALIVE
is enabled on all sockets by default.
I wrote a short test case to create a socket and connect to a server. Immediately after the connect, I check SO_KEEPALIVE
with getsockopt
. The value is non-zero, which according to the MSDN, means keep alive is enabled. Maybe I'm misunderstanding this.
I recently had a strange bug where a server disconnected twice in a row. Some clients were in a state where they had sent logon information and were waiting for a response. Even though there was an overlapped WSARecv
posted to the socket connected to the server, no completion was posted to notify the client that the server crashed, so I'm assuming the socket wasn't fully closed.
Roughly 2 hours later (actually about 1 hour, 59 minutes, and 19 seconds), a completion packet was posted for the read, notifying the client that the connection is no longer open. This is where I started to suspect SO_KEEPALIVE
.
I'm trying to understand why this happened. It caused a bit of an issue because clients who lose their connection for any reason are supposed to automatically reconnect to the server; in this case, because no disconnect was notified, the client didn't reconnect until 2 hours later.
An obvious fix is to put a timeout, but I'd like to know how this situation could occur.
SO_KEEPALIVE
is not set on the socket by my application server or client.
// Error checking is removed for this snippet, but all winsock calls succeed.
int main() {
WORD wVersionRequested;
WSADATA wsaData;
int err;
wVersionRequested = MAKEWORD(2, 2);
err = WSAStartup(wVersionRequested, &wsaData);
SOCKET foo = WSASocket(AF_INET, SOCK_STREAM, IPPROTO_TCP, 0, 0, 0);
DWORD optval;
int optlen = sizeof(optval);
int test = 0;
test = getsockopt(foo, SOL_SOCKET, SO_KEEPALIVE, (char*)&optval, &optlen);
std::cout << "Returned " << optval << std::endl;
sockaddr_in clientService;
clientService.sin_family = AF_INET;
clientService.sin_addr.s_addr = inet_addr("127.0.0.1");
clientService.sin_port = htons(446);
connect(foo, (SOCKADDR*) &clientService, sizeof(clientService));
test = getsockopt(foo, SOL_SOCKET, SO_KEEPALIVE, (char*)&optval, &optlen);
std::cout << "Returned " << optval << std::endl;
std::cin.get();
return 0;
}
// Example output:
// Returned 2883584
// Returned 2883584
回答1:
Firstly run your test on a clean installation of the operating system on a VM. I suspect that something else you have installed has fiddled with the keep alive setting, perhaps.
Secondly, I doubt that keep alive being enabled is the cause of your problem. If keep alive wasn't enabled then you would never have got a connection closure notification from that pending read. TCP is supposed to work like that, it allows for intermediate routers to go away and come back and you to neither know nor care. The only time you will be informed of the failure is if you try and send and the connection is broken (or, in this case, if you try and send and the server has bounced). The fact that keep alive was enabled means that at that 1hr 59mins mark the TCP stack transmitted the keep alive and noticed that the connection was down. If keep alive wasn't enabled then you would have had to wait until YOU transmitted something.
If your clients need to know if the connection goes down then it's better to ignore keep alive completely (as you can see, it affects the whole machine even when you're not the person that enabled it and to me that makes it a poor solution). If you can, add an application level ping and/or timeout to your protocol. So, perhaps, every command expects a response within 30secs and you send a from the server every minute... You'll then find out about dead connection as quickly as you like and you can disconnect and reconnect at that point.
I've used this pretty well with my server framework; in fact I have a standard 'async read timeout' connection filter and a 'connection re-establishment' filter which make it trivial to ensure that the connections are always live. All the read timeout does is abort the existing connection and the connection re-establishment code kicks in to recreate the connection just as it would if the connection had been closed for any other reason.
来源:https://stackoverflow.com/questions/4923586/windows-tcp-socket-has-so-keepalive-enabled-by-default