问题
I've started working with the docker images and set up Kubernetes. I have fixed everything but I am having problems with the timeout of pod recreations.
If one pod is running on one particular node and if I shut it down, it will take ~5 minutes to recreate the pod on another online node.
I've checked all the possible config files, also set all pod-eviction-timeout, horizontal-pod-autoscaler-downscale, horizontal-pod-autoscaler-downscale-delay flags but it is still not working.
Current kube controller manager config:
spec:
containers:
- command:
- kube-controller-manager
- --address=192.168.5.135
- --allocate-node-cidrs=false
- --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
- --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
- --client-ca-file=/etc/kubernetes/pki/ca.crt
- --cluster-cidr=192.168.5.0/24
- --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
- --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
- --controllers=*,bootstrapsigner,tokencleaner
- --kubeconfig=/etc/kubernetes/controller-manager.conf
- --leader-elect=true
- --node-cidr-mask-size=24
- --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
- --root-ca-file=/etc/kubernetes/pki/ca.crt
- --service-account-private-key-file=/etc/kubernetes/pki/sa.key
- --use-service-account-credentials=true
- --horizontal-pod-autoscaler-downscale-delay=20s
- --horizontal-pod-autoscaler-sync-period=20s
- --node-monitor-grace-period=40s
- --node-monitor-period=5s
- --pod-eviction-timeout=20s
- --use-service-account-credentials=true
- --horizontal-pod-autoscaler-downscale-stabilization=20s
image: k8s.gcr.io/kube-controller-manager:v1.13.0
Thank you.
回答1:
If Taint Based Evictions are present in the pod definition, controller manager will not be able to evict the pod that tolerates the taint. Even if you don't define an eviction policy in your configuration, it gets a default one since Default Toleration Seconds admission controller plugin is enabled by default.
Default Toleration Seconds admission controller plugin configures your pod like below:
tolerations:
- key: node.kubernetes.io/not-ready
effect: NoExecute
tolerationSeconds: 300
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
tolerationSeconds: 300
You can verify this by inspecting definition of your pod:
kubectl get pods -o yaml -n <namespace> <pod-name>`
According to above toleration it takes more than 5 minutes to recreate the pod on another ready node since pod can tolerate not-ready
taint for up to 5 minutes. In this case, even if you set --pod-eviction-timeout
to 20s, there is nothing controller manager can do because of the tolerations.
But why it takes more than 5 minutes? Because the node will be considered as down after --node-monitor-grace-period
which defaults to 40s. After that, pod toleration timer starts.
Recommended Solution
If you want your cluster to react faster for node outages, you should use taints and tolerations without modifying options. For example, you can define your pod like below:
tolerations:
- key: node.kubernetes.io/not-ready
effect: NoExecute
tolerationSeconds: 0
- key: node.kubernetes.io/unreachable
effect: NoExecute
tolerationSeconds: 0
With above toleration your pod will be recreated on a ready node just after the current node marked as not ready. This should take less then a minute since --node-monitor-grace-period
is default to 40s.
Available Options
If you want to be in control of these timings below you will find plenty of options to do so. However, modifying these options should be avoided. If you use tight timings which might create an overhead on etcd as every node will try to update its status very often.
In addition to this, currently it is not clear how to propagate changes in controller manager, api server and kubelet configuration to all nodes in a living cluster. Please see Tracking issue for changing the cluster and Dynamic Kubelet Configuration. As of this writing, reconfiguring a node's kubelet in a live cluster is in beta.
You can configure control plane and kubelet during kubeadm init or join phase. Please refer to Customizing control plane configuration with kubeadm and Configuring each kubelet in your cluster using kubeadm for more details.
Assuming you have a single node cluster:
- controller manager includes:
--node-monitor-grace-period
default 40s--node-monitor-period
default 5s--pod-eviction-timeout
default 5m0s
- api server includes:
--default-not-ready-toleration-seconds
default 300--default-unreachable-toleration-seconds
default 300
- kubelet includes:
--node-status-update-frequency
default 10s
If you set up the cluster with kubeadm
you can modify:
/etc/kubernetes/manifests/kube-controller-manager.yaml
for controller manager options./etc/kubernetes/manifests/kube-apiserver.yaml
for api server options.
Note: Modifying these files will reconfigure and restart the respective pod in the node.
In order to modify kubelet
config you can add below line:
KUBELET_EXTRA_ARGS="--node-status-update-frequency=10s"
To /etc/default/kubelet
(for DEBs), or /etc/sysconfig/kubelet
(for RPMs) and then restart kubelet service:
sudo systemctl daemon-reload && sudo systemctl restart kubelet
回答2:
This is what happens when node dies or go into offline mode:
- The kubelet posts its status to masters by
--node-status-update-fequency=10s
. - Node goes offline
- kube-controller-manager is monitoring all the nodes by
--node-monitor-period=5s
- kube-controller-manager will see the node is unresponsive and has the grace period
--node-monitor-grace-period=40s
until it considers node unhealthy. PS: This parameter should be inN x node-status-update-fequency
- Once the node marked unhealthy, the kube-controller-manager will remove the pods based on
--pod-eviction-timeout=5m
Now, if you tweaked the parameter pod-eviction-timeout
to say 30 seconds, it will still take
node status update frequency: 10s
node-monitor-period: 5s
node-monitor-grace-period: 40s
pod-eviction-timeout: 30s
Total 70 seconds to evict the pod from node
The node-status-update-fequecy and node-monitor-grace-period
time counts in node-monitor-grace-period
also. You can tweak these variable as well to further lower down your total node eviction time.
This is my kube-controller-manager.yaml (present at /etc/kubernetes/manifests for kubeadm) file:
containers:
- command:
- kube-controller-manager
- --controllers=*,bootstrapsigner,tokencleaner
- --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
- --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
- --pod-eviction-timeout=30s
- --address=127.0.0.1
- --use-service-account-credentials=true
- --kubeconfig=/etc/kubernetes/controller-manager.conf
I am effectively seeing my pods get evicted in 70s
once I turn off my node.
EDIT2:
Run following command on master and check that the --pod-eviction-timeout
comes as 20s
.
[root@ip-10-0-1-12]# docker ps --no-trunc | grep "kube-controller-manager"
9bc26f99dcfe6ac0e7b2abf22bff67af6060561ee8c0cdff08e11c3a479f182c sha256:40c8d10b2d11cbc3db2e373a5ffce60dd22dbbf6236567f28ac6abb7efbfc8a9
"kube-controller-manager --leader-elect=true --use-service-account-credentials=true --root-ca-file=/etc/kubernetes/pki/ca.crt --cluster-signing-key-file=/etc/kubernetes/pki/ca.key \
**--pod-eviction-timeout=30s** --address=127.0.0.1 --controllers=*,bootstrapsigner,tokencleaner --kubeconfig=/etc/kubernetes/controller-manager.conf --service-account-private-key-file=/etc/kubernetes/pki/sa.key --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt --allocate-node-cidrs=true --cluster-cidr=192.168.13.0/24 --node-cidr-mask-size=24"
If here --pod-eviction-timeout
is 5m
and not 20s
then your changes are not applied properly.
来源:https://stackoverflow.com/questions/53641252/kubernetes-recreate-pod-if-node-becomes-offline-timeout