I have a regional cluster set up in google kubernetes engine (GKE). The node group is a single vm in each region (3 total). I have a deploy
I agree that according to [Documentation][1] it seems that "gke-name-cluster-default-pool" could be safely deleted, conditions:
DaemonSets
) can be moved to other nodes.However checking the [Documentation][2] I found:
What types of pods can prevent CA from removing a node?
[...] Kube-system pods that are not run on the node by default, * [..]
heapster-v1.5.2--- is running on the node and it is a Kube-system pod that is not run on the node by default.
I will update the answer if I discover more interesting information.
The fact that the node it is the last one in the zone is not an issue.
To prove it I tested on a brand new cluster with 3 nodes each one in a different zone, one of them was without any workload apart from "kube-proxy" and "fluentd" and was correctly deleted even if it was bringing the size of the zone to zero. [1]: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md [2]: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node
Answering myself for visibility.
The problem is that the CA never considers moving anything unless all the requirements mentioned in the FAQ are met at the same time. So lets say I have 100 nodes with 51% CPU requests. It still wont consider downscaling.
One solution is to increase the value at which CA checks, now 50%. But unfortunately that is not supported by GKE, see answer from google support @GalloCedrone:
Moreover I know that this value might sound too low and someone could be interested to keep as well a 85% or 90% to avoid your scenario. Currently there is a feature request open to give the user the possibility to modify the flag "--scale-down-utilization-threshold", but it is not implemented yet.
The workaround I found is to decrease the CPU request (100m instead of 300m) of the pods and have the Horizontal Pod Autoscaler (HPA) create more on demand. This is fine for me but if your application is not suitable for many small instances you are out of luck. Perhaps a cron job that cordons a node if the total utilization is low?