I have a web-based application and a client, both written in Java. For what it\'s worth, the client and server are both on Windows. The client issues HTTP GETs via Apache HttpC
Are you absolutely sure that the server has successfully sent the response to the clients that seem to fail? By this I mean the server has sent the response and the client has ack'ed that response back to the server. You should see this using wireshark on the server side. If you are sure this has occured on the server side and the client still does not see anything, you need to look further up the chain from the server. Are there any proxy/reverse proxy servers or NAT involved?
The TCP transport is considered to be a reliable protocol, but it does not guarantee delivery. The TCP/IP stack of your OS will try pretty hard to get packets to the other end using TCP retransmissions. You should see these in wireshark on the server side if this is happening. If you see excessive TCP retransmissions, it is usually a network infrastructure issue - i.e. bad or misconfigured hardware/interfaces. TCP retransmissions works great for short network interruptions, but performs poorly on a network with a longer interruption. This is because the TCP/IP stack will only send retransmissions after a timer expires. This timer typically doubles after each unsuccessful retransmission. This is by design to avoid flooding an already problematic network with retransmissions. As you might imagine, this usually causes applications all sorts of timeout issues.
Depending on your network topology, you may also need to place probes/wireshark/tcpdump at other intermediate locations in the network. This will probably take some time to find out where the packets have gone.
If I were you I would keep monitoring with wireshark on all ends until the problem re-occurs. It mostly likely will. But, it sounds like what you will ultimately find is what you already mentioned - flaky hardware. If fixing the flaky hardware is out of the question, you may need to just build in extra application level timeouts and retries to attempt to deal with the issue in software. It sounds like you started going down this path.