Short description: I\'m trying to get a ZuulProxy to handle instance failover but it throws ZuulException: Forwarding error, instead of responding with a result
1/ TIMEOUT
Zuul requests are monitored by Hystrix whose purpose (in that application) is to apply timeouts on long running requests.
Hystrix provides two different ways to execute commands and enforce timeouts: SEMAPHORE and THREAD execution isolation.
When THREAD isolation is used, Hystrix commands are executed on a separate thread from a thread pool. Hystrix then "pauses" the thread holding the incoming request until a response is received from the down stream server or a timeout occurs.
When SEMAPHORE isolation is used, Hystrix commands are executed on the request thread. Timeouts are detected only after a response is received from the down stream server. So if you configure Zuul/Hystrix with a timeout of 5s and your service takes 30s to complete, your client will be notified of the timeout only after 30s - even if the service responded successfully (!)
Netflix recommends THREAD execution by default except in some rare cases. Unfortunately, the SpringCloud Zuul integration changed it to SEMAPHORE for reasons unknown to me. See Why is ZUUL forcing a SEMAPHORE isolation to execute its Hystrix commands? for more information.
This explains why you receive a 500 error although the remaining live server was successfully contacted.
2/ RETRY
Ribbon is used to make the actual call to remote service. It uses information provided by Eureka to determine the available services and the corresponding addresses. Eureka uses a local cache that is updated every 30 seconds. So as @spencergibb said, it is likely to hold obsolete information for a while (dead server) - but this is expected.
Ribbon automatically retries when it fails to connect/contact a service. It can be configured to retry the same server a couple of time before trying another. I don't remember the default values nor the actual configuration property, but personally I have been using the following settings:
# Max number of retries on the same server (excluding the first try)
ribbon.maxAutoRetries = 1
# Max number of next servers to retry (excluding the first server)
ribbon.MaxAutoRetriesNextServer = 2
3/ CONNECT TIMEOUT
From your logs it appears it takes about 1s to fail the connect attempt to the remote service. This very long for a stopped service. Attempts to connect to a TCP port with no service listening should fail immediately (at least if the host/ip is reachable and the connect attempt doesn't end in the void)...
The connect timeout is controlled by the following property - make sure you set it to a descent value:
# Connect timeout used by Apache HttpClient
ribbon.ConnectTimeout=3000
# Read timeout used by Apache HttpClient
ribbon.ReadTimeout=5000
Hope this information helps you to troubleshoot your problem ;-)