I have a web-based application and a client, both written in Java. For what it\'s worth, the client and server are both on Windows. The client issues HTTP GETs via Apache HttpC
Forgetting to flush or close the socket on the host side can intermittently have this effect for short responses depending on timing which could be affected by the presence of any monitoring mechanism.
Especially forgetting to close will leave the socket dangling until GC gets around to reclaiming it and calls finalize().
Could these computers have a virus/malware installed? Using wireshark installs winpcap (http://www.winpcap.org/) which may be overriding the changes the malware made (or the malware may simply detect it is being monitored and not attempt anything fishy).
If you are losing data, it is most likely due to a software bug, either in the reading or writing library.
I haven't seen this one per se but I have seen similar problems with large UDP datagrams causing IP fragmentation which lead to congestion and ultimately dropped Ethernet frames. Since this is TCP/IP I wouldn't expect IP fragmentation to be a large issue since it is a stream-based protocol.
One thing that I will note is that TCP does not guarantee delivery! It can't. What it does guarantee is that if you send byte A followed by byte B, then you will never receive byte B before you have received byte A.
With that said, I would connect the client machine and a monitoring machine to a hub. Run Wireshark on the monitoring machine and you should be able to see what is going on. I did run into problems related to both whitespace handling between HTTP requests and incorrect HTTP chunk sizes. Both issues were due to a hand written HTTP stack so this is only a problem if you are using a flaky stack.
Are you absolutely sure that the server has successfully sent the response to the clients that seem to fail? By this I mean the server has sent the response and the client has ack'ed that response back to the server. You should see this using wireshark on the server side. If you are sure this has occured on the server side and the client still does not see anything, you need to look further up the chain from the server. Are there any proxy/reverse proxy servers or NAT involved?
The TCP transport is considered to be a reliable protocol, but it does not guarantee delivery. The TCP/IP stack of your OS will try pretty hard to get packets to the other end using TCP retransmissions. You should see these in wireshark on the server side if this is happening. If you see excessive TCP retransmissions, it is usually a network infrastructure issue - i.e. bad or misconfigured hardware/interfaces. TCP retransmissions works great for short network interruptions, but performs poorly on a network with a longer interruption. This is because the TCP/IP stack will only send retransmissions after a timer expires. This timer typically doubles after each unsuccessful retransmission. This is by design to avoid flooding an already problematic network with retransmissions. As you might imagine, this usually causes applications all sorts of timeout issues.
Depending on your network topology, you may also need to place probes/wireshark/tcpdump at other intermediate locations in the network. This will probably take some time to find out where the packets have gone.
If I were you I would keep monitoring with wireshark on all ends until the problem re-occurs. It mostly likely will. But, it sounds like what you will ultimately find is what you already mentioned - flaky hardware. If fixing the flaky hardware is out of the question, you may need to just build in extra application level timeouts and retries to attempt to deal with the issue in software. It sounds like you started going down this path.
If you are using long running GETs, you should timeout on the client side at twice the server timeout, as you have discovered.
On a TCP where the client send a message and expects a response, if the server were to crash, and restart (lets say for the point of examples) then the client would still be waiting on the socket to get a response from the Server yet the server is no longer listening on that socket.
The client will only discover the socket is closed on the server end once it sends more data on that socket, and the server rejects this new data, and closes the socket.
This is why you should have client side time-outs on requests.
But as your server is not crashing, if the server was multi threaded, and thread socket for that client closed, but at that time ( duration minutes) the client has an connectivity outage, then the end socket hand-shaking my be lost, and as you are not sending more data to the server from the client, your client is once again left hanging. This would tie in to your flaking connection observation.