Skip to content

Latest commit

 

History

History
180 lines (132 loc) · 8.15 KB

node-allocatable.md

File metadata and controls

180 lines (132 loc) · 8.15 KB

WARNING WARNING WARNING WARNING WARNING

PLEASE NOTE: This document applies to the HEAD of the source tree

If you are using a released version of Kubernetes, you should refer to the docs that go with that version.

Documentation for other releases can be found at releases.k8s.io.

Node Allocatable Resources

Issue: kubernetes#13984

Overview

Currently Node.Status has Capacity, but no concept of node Allocatable. We need additional parameters to serve several purposes:

  1. Kubernetes metrics provides "/docker-daemon", "/kubelet", "/kube-proxy", "/system" etc. raw containers for monitoring system component resource usage patterns and detecting regressions. Eventually we want to cap system component usage to a certain limit / request. However this is not currently feasible due to a variety of reasons including: 1. Docker still uses tons of computing resources (See #16943) 2. We have not yet defined the minimal system requirements, so we cannot control Kubernetes nodes or know about arbitrary daemons, which can make the system resources unmanageable. Even with a resource cap we cannot do a full resource management on the node, but with the proposed parameters we can mitigate really bad resource over commits 3. Usage scales with the number of pods running on the node
  2. For external schedulers (such as mesos, hadoop, etc.) integration, they might want to partition compute resources on a given node, limiting how much Kubelet can use. We should provide a mechanism by which they can query kubelet, and reserve some resources for their own purpose.

Scope of proposal

This proposal deals with resource reporting through the Allocatable field for more reliable scheduling, and minimizing resource over commitment. This proposal does not cover resource usage enforcement (e.g. limiting kubernetes component usage), pod eviction (e.g. when reservation grows), or running multiple Kubelets on a single node.

Design

Definitions

image

  1. Node Capacity - Already provided as NodeStatus.Capacity, this is total capacity read from the node instance, and assumed to be constant.
  2. System-Reserved (proposed) - Compute resources reserved for processes which are not managed by Kubernetes. Currently this covers all the processes lumped together in the /system raw container.
  3. Kubelet Allocatable - Compute resources available for scheduling (including scheduled & unscheduled resources). This value is the focus of this proposal. See below for more details.
  4. Kube-Reserved (proposed) - Compute resources reserved for Kubernetes components such as the docker daemon, kubelet, kube proxy, etc.

API changes

Allocatable

Add Allocatable (4) to NodeStatus:

type NodeStatus struct {
  ...
  // Allocatable represents schedulable resources of a node.
  Allocatable ResourceList `json:"allocatable,omitempty"`
  ...
}

Allocatable will be computed by the Kubelet and reported to the API server. It is defined to be:

   [Allocatable] = [Node Capacity] - [Kube-Reserved] - [System-Reserved]

The scheduler will use Allocatable in place of Capacity when scheduling pods, and the Kubelet will use it when performing admission checks.

Note: Since kernel usage can fluctuate and is out of kubernetes control, it will be reported as a separate value (probably via the metrics API). Reporting kernel usage is out-of-scope for this proposal.

Kube-Reserved

KubeReserved is the parameter specifying resources reserved for kubernetes components (4). It is provided as a command-line flag to the Kubelet at startup, and therefore cannot be changed during normal Kubelet operation (this may change in the future).

The flag will be specified as a serialized ResourceList, with resources defined by the API ResourceName and values specified in resource.Quantity format, e.g.:

--kube-reserved=cpu=500m,memory=5Mi

Initially we will only support CPU and memory, but will eventually support more resources. See #16889 for disk accounting.

If KubeReserved is not set it defaults to a sane value (TBD) calculated from machine capacity. If it is explicitly set to 0 (along with SystemReserved), then Allocatable == Capacity, and the system behavior is equivalent to the 1.1 behavior with scheduling based on Capacity.

System-Reserved

In the initial implementation, SystemReserved will be functionally equivalent to KubeReserved, but with a different semantic meaning. While KubeReserved designates resources set aside for kubernetes components, SystemReserved designates resources set aside for non-kubernetes components (currently this is reported as all the processes lumped together in the /system raw container).

Issues

Kubernetes reservation is smaller than kubernetes component usage

Solution: Initially, do nothing (best effort). Let the kubernetes daemons overflow the reserved resources and hope for the best. If the node usage is less than Allocatable, there will be some room for overflow and the node should continue to function. If the node has been scheduled to capacity (worst-case scenario) it may enter an unstable state, which is the current behavior in this situation.

In the future we may set a parent cgroup for kubernetes components, with limits set according to KubeReserved.

Version discrepancy

API server / scheduler is not allocatable-resources aware: If the Kubelet rejects a Pod but the scheduler expects the Kubelet to accept it, the system could get stuck in an infinite loop scheduling a Pod onto the node only to have Kubelet repeatedly reject it. To avoid this situation, we will do a 2-stage rollout of Allocatable. In stage 1 (targeted for 1.2), Allocatable will be reported by the Kubelet and the scheduler will be updated to use it, but Kubelet will continue to do admission checks based on Capacity (same as today). In stage 2 of the rollout (targeted for 1.3 or later), the Kubelet will start doing admission checks based on Allocatable.

API server expects Allocatable but does not receive it: If the kubelet is older and does not provide Allocatable in the NodeStatus, then Allocatable will be defaulted to Capacity (which will yield today's behavior of scheduling based on capacity).

3rd party schedulers

The community should be notified that an update to schedulers is recommended, but if a scheduler is not updated it falls under the above case of "scheduler is not allocatable-resources aware".

Future work

  1. Convert kubelet flags to Config API - Prerequisite to (2). See #12245.
  2. Set cgroup limits according KubeReserved - as described in the overview
  3. Report kernel usage to be considered with scheduling decisions.

Analytics