How to automatically stop rolling update when CrashLoopBackOff?

问题

I use Google Kubernetes Engine and I intentionally put an error in the code. I was hoping the rolling update will stop when it discovers the status is CrashLoopBackOff, but it wasn't.

In this page, they say..

The Deployment controller will stop the bad rollout automatically, and will stop scaling up the new ReplicaSet. This depends on the rollingUpdate parameters (maxUnavailable specifically) that you have specified.

But it's not happening, is it only if the status ImagePullBackOff?

Below is my configuration.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: volume-service
  labels:
    group: volume
    tier: service
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 2
      maxSurge: 2
  template:
    metadata:
      labels:
        group: volume
        tier: service
    spec:
      containers:
      - name: volume-service
        image: gcr.io/example/volume-service:latest

P.S. I already read liveness/readiness probes, but I don't think it can stop a rolling update? or is it?

回答1:

Turns out I just need to set minReadySeconds and it stops the rolling update when the new replicaSet has status CrashLoopBackOff or something like Exited with status code 1. So now the old replicaSet still available and not updated.

Here is the new config.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: volume-service
  labels:
    group: volume
    tier: service
spec:
  replicas: 4
  minReadySeconds: 60
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 2
      maxSurge: 2
  template:
    metadata:
      labels:
        group: volume
        tier: service
    spec:
      containers:
      - name: volume-service
        image: gcr.io/example/volume-service:latest

Thank you for averyone help!

回答2:

The explanation you quoted is correct, and it means that the new replicaSet (the one with the error) will not proceed to completion, but it will be stopped in its progression to the maxSurge+maxUnavailable count. And the old replicaSet will be present too.

Here the example I tried with:

spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1

And these are the results:

NAME                                  READY     STATUS             RESTARTS   AGE
pod/volume-service-6bb8dd677f-2xpwn   0/1       ImagePullBackOff   0          42s
pod/volume-service-6bb8dd677f-gcwj6   0/1       ImagePullBackOff   0          42s
pod/volume-service-c98fd8d-kfff2      1/1       Running            0          59s
pod/volume-service-c98fd8d-wcjkz      1/1       Running            0          28m
pod/volume-service-c98fd8d-xvhbm      1/1       Running            0          28m

NAME                                              DESIRED   CURRENT   READY     AGE
replicaset.extensions/volume-service-6bb8dd677f   2         2         0         26m
replicaset.extensions/volume-service-c98fd8d      3         3         3         28m

My new replicaSet will start only 2 new pods (1 slot from the maxUnavailable and 1 slot from the maxSurge).

The old replicaSet will keep running 3 pods (4 - 1 unAvailable).

The two params you set in the rollingUpdate section are the key point, but you can play also with other factors like readinessProbe, livenessProbe, minReadySeconds, progressDeadlineSeconds.

For them, here the reference.

回答3:

I agree with @Nicola_Ben - I would also consider changing to the setup below:

spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1 <----- I want at least (4)-[1] = 3 available pods.
      maxSurge: 1       <----- I want maximum  (4)+[1] = 5 total running pods.

Or even change maxSurge to 0.
This will help us to expose less possibly nonfunctional pods (like we would do in canary release).

Like @Hana_Alaydrus suggested its important to setup minReadySeconds.

With addition to that, sometimes we need to take more actions after the rollout execution.
(For example, there are cases when the new pods not functioning properly but the process running inside the container haven't crash).

A suggestion for a general debug process:

1 ) First of all, pause the rollout with: kubectl rollout pause deployment <name>.

2 ) Debug the relevant pods and decide how to continue (maybe we can continue with with the new release, maybe not).

3 ) We would have to resume the rollout with: kubectl rollout resume deployment <name> because even if we decided to return to previous release with the undo command (4.B) we need first to resume the rollout.

4.A ) Continue with new release.

4.B ) Return to previous release with: kubectl rollout undo deployment <name>.

Below is a visual summary (click inside in order to view the comments):

来源：https://stackoverflow.com/questions/52121422/how-to-automatically-stop-rolling-update-when-crashloopbackoff

标签

Kubernetes

google-kubernetes-engine