深入分析Kubernetes Critical Pod(一)

我的未来我决定 提交于 2020-03-27 18:16:07

3 月,跳不动了?>>>

大家在Kubernetes集群中部署核心组件时,经常会用到Critical Pod,那么你知道Critical Pod到底有何特别吗?要完整的了解这一点,其实并不是那么简单,它关系到调度、Kubelet Eviction Manager、DaemonSet Controller、Kubelet Preemption等,我将分4个系列为大家剖析。这一篇先介绍Critical Pod在Predicate in Schedule阶段的行为,以及用户期望的行为等。

官方宣布Rescheduler is deprecated as of Kubernetes 1.10 and will be removed in version 1.12,所以本文将不讨论Rescheduler对Critical Pod的处理逻辑。

有什么方法标识一个Pod为Critical Pod

规则1:

  • Enable Feature Gate ExperimentalCriticaPodAnnotation
  • 必须隶属于kube-system namespace;
  • 必须加上Annotation scheduler.alpha.kubernetes.io/critical-pod=""

规则2:

  • Enable Feature Gate ExperimentalCriticaPodAnnotation, PodPriority

  • Pod的Priority不为空,且不小于2 * 10^9;

    system-node-critical priority = 10^9 + 1000;
    system-cluster-critical priority = 10^9;

满足规则1或规则2之一,就认为该Pod为Critical Pod;

Schedule Critical Pod

在default scheduler进行pod调度的predicate阶段,会注册GeneralPredicates为default predicates之一,并没有判断critical Pod使用EssentialPredicates来对critical Pod进行predicate process。这意味着什么呢?

我们看看GeneralPredicates和EssentialPredicates的关系就知道了。GeneralPredicates中,先调用noncriticalPredicates,再调用EssentialPredicates。因此如果你给Deployment/StatefulSet等(DeamonSet除外)标识为Critical,那么在scheduler调度时,仍然走GeneralPredicates的流程,会调用noncriticalPredicates,而你却希望它直接走EssentialPredicates。

// GeneralPredicates checks whether noncriticalPredicates and EssentialPredicates pass. noncriticalPredicates are the predicates
// that only non-critical pods need and EssentialPredicates are the predicates that all pods, including critical pods, need
func GeneralPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
	var predicateFails []algorithm.PredicateFailureReason
	fit, reasons, err := noncriticalPredicates(pod, meta, nodeInfo)
	if err != nil {
		return false, predicateFails, err
	}
	if !fit {
		predicateFails = append(predicateFails, reasons...)
	}

	fit, reasons, err = EssentialPredicates(pod, meta, nodeInfo)
	if err != nil {
		return false, predicateFails, err
	}
	if !fit {
		predicateFails = append(predicateFails, reasons...)
	}

	return len(predicateFails) == 0, predicateFails, nil
}

noncriticalPredicates原意是想对non-critical pod做的额外predicate逻辑,这个逻辑就是PodFitsResources检查。

pkg/scheduler/algorithm/predicates/predicates.go:1076

// noncriticalPredicates are the predicates that only non-critical pods need
func noncriticalPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
	var predicateFails []algorithm.PredicateFailureReason
	fit, reasons, err := PodFitsResources(pod, meta, nodeInfo)
	if err != nil {
		return false, predicateFails, err
	}
	if !fit {
		predicateFails = append(predicateFails, reasons...)
	}

	return len(predicateFails) == 0, predicateFails, nil
}

PodFitsResources就做以下检查资源是否满足要求:

  • Allowed Pod Number;
  • CPU;
  • Memory;
  • EphemeralStorage;
  • Extended Resources;

也就是说,如果你给Deployment/StatefulSet等(DeamonSet除外)标识为Critical,那么对应的Pod调度时仍然会检查Allowed Pod Number, CPU, Memory, EphemeralStorage,Extended Resources是否足够,如果不满足则会触发预选失败,并且在Preempt阶段也只是根据对应的PriorityClass进行正常的抢占逻辑,并没有针对Critical Pod进行特殊处理,因此最终可能会因为找不到满足资源要求的Node,导致该Critical Pod调度失败,一直处于Pending状态。

而用户设置Critical Pod是不想因为资源不足导致调度失败的。那如果我就是想使用Deployment/StatefulSet等(DeamonSet除外)标识为Critical Pod来部署关键服务呢?有以下两个办法:

  1. 按照前面提到的规则2,给Pod设置system-cluster-criticalsystem-node-critical Priority Class,这样就会在scheduler正常的Preempt流程中抢占到资源完成调度。
  2. 按照前面提到的规则1,并且修改GeneralPredicates 的代码如下,检测是否为Critical Pod,如果是,则不执行noncriticalPredicates逻辑,也就是说predicate阶段不对Allowed Pod Number, CPU, Memory, EphemeralStorage,Extended Resources资源进行检查。
func GeneralPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
	var predicateFails, resons []algorithm.PredicateFailureReason
	var fit bool
	var err error
	
	// **Modify**: check whether the pod is a Critical Pod, don't invoke noncriticalPredicates if false.
	isCriticalPod := utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) &&
		kubelettypes.IsCriticalPod(newPod)
	
	if !isCriticalPod {
	   fit, reasons, err = noncriticalPredicates(pod, meta, nodeInfo)
    	if err != nil {
    		return false, predicateFails, err
    	}
	}
	
	if !fit {
		predicateFails = append(predicateFails, reasons...)
	}

	fit, reasons, err = EssentialPredicates(pod, meta, nodeInfo)
	if err != nil {
		return false, predicateFails, err
	}
	if !fit {
		predicateFails = append(predicateFails, reasons...)
	}

	return len(predicateFails) == 0, predicateFails, nil
}

方法1,其实Kubernetes在Admission Priority检查时已经帮你做了。

// admitPod makes sure a new pod does not set spec.Priority field. It also makes sure that the PriorityClassName exists if it is provided and resolves the pod priority from the PriorityClassName.
func (p *priorityPlugin) admitPod(a admission.Attributes) error {
	...
	if utilfeature.DefaultFeatureGate.Enabled(features.PodPriority) {
		var priority int32
		if len(pod.Spec.PriorityClassName) == 0 &&
			utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) &&
			kubelettypes.IsCritical(a.GetNamespace(), pod.Annotations) {
			pod.Spec.PriorityClassName = scheduling.SystemClusterCritical
		}
            ...
}

在Admission时候会对Pod的Priority进行检查,如果发现您已经:

  • Enable PriorityClass Feature Gate;
  • Enable ExperimentalCriticalPodAnnotation Feature Gate;
  • 给Pod添加了ExperimentalCriticalPodAnnotation;
  • 部署在kube-system namespace;
  • 没有手动设置自定义PriorityClass;

那么,Admisson Priority阶段会自动给Pod添加SystemClusterCritical(system-cluster-critical) PriorityClass;

最佳实践

通过上面的分析,给出如下最佳实践:在Kubernetes集群中,通过非DeamonSet方式(比如Deployment、RS等)部署关键服务时,为了在集群资源不足时仍能保证抢占调度成功,请确保如下事宜:

  • Enable PriorityClass Feature Gate;
  • Enable ExperimentalCriticalPodAnnotation Feature Gate;
  • 给Pod添加了ExperimentalCriticalPodAnnotation;
  • 部署在kube-system namespace;
  • 千万不要手动设置自定义PriorityClass;

总结

本文介绍了标识一个关键服务为Critical服务的两种方法,并介绍了Critical Pod(DaemonSet部署方式除外)在Predicate in Schedule阶段的行为,给出了最佳实践。

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!