Spring-cloud Zuul retry when instance is down

问题

Using Spring-cloud Angel.SR6:

Here is the configuration of my Spring-boot app with @EnableZuulProxy:

server.port=8765

ribbon.ConnectTimeout=500
ribbon.ReadTimeout=5000
ribbon.MaxAutoRetries=1
ribbon.MaxAutoRetriesNextServer=1
ribbon.OkToRetryOnAllOperations=true

zuul.routes.service-id.retryable=true

I have 2 instances of service-id running on random ports. These instances, as well as the Zuul instance, successfully register with Eureka, and I can access RESTful endpoints on the 2 service-id instances by accessing http://localhost:8765/service-id/.... and find that they are balanced in a round-robin manner.

I would like to kill one of the service-id instances and, when that defunct instance is next in line for forwarding, have Zuul attempt to contact it, fail, and retry with the other instance.

Is this possible, or am I misreading the documentation? When I try the above config, the request 'destined' for the defunct instance fails with a 500 Forwarding error. From the Zuul stacktrace:

com.netflix.zuul.exception.ZuulException: Forwarding error
    at org.springframework.cloud.netflix.zuul.filters.route.RibbonRoutingFilter.forward(RibbonRoutingFilter.java:140)

....

Caused by: com.netflix.hystrix.exception.HystrixRuntimeException: service-idRibbonCommand timed-out and no fallback available

The subsequent request succeeds as expected. This behavior continues until the defunct instance is removed from Zuul's registry.

EDIT: Updated to Brixton.M5. No change in behavior. Here's the Hystrix exception in more detail:

Caused by: com.netflix.hystrix.exception.HystrixRuntimeException: service-id timed-out and no fallback available.
    at com.netflix.hystrix.AbstractCommand$16.call(AbstractCommand.java:806) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.AbstractCommand$16.call(AbstractCommand.java:790) ~[hystrix-core-1.4.23.jar:1.4.23]
    at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$1.onError(OperatorOnErrorResumeNextViaFunction.java:99) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.internal.operators.OperatorDoOnEach$1.onError(OperatorDoOnEach.java:70) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.internal.operators.OperatorDoOnEach$1.onError(OperatorDoOnEach.java:70) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.internal.operators.OperatorDoOnEach$1.onError(OperatorDoOnEach.java:70) ~[rxjava-1.0.14.jar:1.0.14]
    at com.netflix.hystrix.AbstractCommand$DeprecatedOnFallbackHookApplication$1.onError(AbstractCommand.java:1521) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.AbstractCommand$FallbackHookApplication$1.onError(AbstractCommand.java:1411) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:314) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:306) ~[hystrix-core-1.4.23.jar:1.4.23]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:162) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable$2.call(Observable.java:154) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.Observable.unsafeSubscribe(Observable.java:7710) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$1.onError(OperatorOnErrorResumeNextViaFunction.java:100) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.internal.operators.OperatorDoOnEach$1.onError(OperatorDoOnEach.java:70) ~[rxjava-1.0.14.jar:1.0.14]
    at rx.internal.operators.OperatorDoOnEach$1.onError(OperatorDoOnEach.java:70) ~[rxjava-1.0.14.jar:1.0.14]
    at com.netflix.hystrix.AbstractCommand$HystrixObservableTimeoutOperator$1.run(AbstractCommand.java:958) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.strategy.concurrency.HystrixContextRunnable$1.call(HystrixContextRunnable.java:41) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.strategy.concurrency.HystrixContextRunnable$1.call(HystrixContextRunnable.java:37) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.strategy.concurrency.HystrixContextRunnable.run(HystrixContextRunnable.java:57) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.AbstractCommand$HystrixObservableTimeoutOperator$2.tick(AbstractCommand.java:978) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.util.HystrixTimer$1.run(HystrixTimer.java:100) ~[hystrix-core-1.4.23.jar:1.4.23]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_66]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[na:1.8.0_66]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[na:1.8.0_66]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[na:1.8.0_66]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_66]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_66]
    ... 1 common frames omitted

Caused by: java.util.concurrent.TimeoutException: null
    at com.netflix.hystrix.AbstractCommand$9.call(AbstractCommand.java:601) ~[hystrix-core-1.4.23.jar:1.4.23]
    at com.netflix.hystrix.AbstractCommand$9.call(AbstractCommand.java:581) ~[hystrix-core-1.4.23.jar:1.4.23]
    at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$1.onError(OperatorOnErrorResumeNextViaFunction.java:99) ~[rxjava-1.0.14.jar:1.0.14]
    ... 15 common frames omitted

回答1:

I had the same problem. This solved it for me:

Regarding to this article ribbon is only retring when the http client is set to ribbon's restclient. On default ribbon is using the Apache http client which does not retry any request.

Due the fact that ribbon's restclient is deprecated you should consider using spring-retry (https://github.com/spring-projects/spring-retry)

Keep in mind that you also have to handle the hystrix timeouts for zuul as well when you configure your retries on ribbon.

回答2:

Ribbon uses the registered serviced in eureka, so it is up to eureka to update service status and let caller knows the available servers.

In my understanding, when one server is down, there are 2 ways to know:
1. wait for eureka server to update service status. But this update will take some time, 30 seconds as default.
2. try to call and mark it as down, (maybe will confirm with eureka server later)

So, in you question, you said after the first request failed, subsequent request succeeds. I think it is right behavior.

来源：https://stackoverflow.com/questions/35751630/spring-cloud-zuul-retry-when-instance-is-down

标签

Spring

spring-cloud

microservices

netflix-eureka

netflix-zuul