From 0d476fc9f87c01fcedc5293c3c3a7a87c1a36237 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Wed, 6 Oct 2021 16:45:27 +0300 Subject: [PATCH 01/92] KEP-3008: initial version of class-based resources KEP --- .../3008-cri-class-based-resources/README.md | 1131 +++++++++++++++++ 1 file changed, 1131 insertions(+) create mode 100644 keps/sig-node/3008-cri-class-based-resources/README.md diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md new file mode 100644 index 00000000000..bc3368e9a5d --- /dev/null +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -0,0 +1,1131 @@ + +# KEP-3008: Class-based resources + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Story 3](#story-3) + - [Story 4](#story-4) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [CRI protocol](#cri-protocol) + - [Pod Spec](#pod-spec) + - [Container runtimes](#container-runtimes) + - [Open Questions](#open-questions) + - [Pod QoS class](#pod-qos-class) + - [Default class](#default-class) + - [Test Plan](#test-plan) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) + - [Future work](#future-work) + - [Resource status/capacity](#resource-statuscapacity) + - [Resource discovery](#resource-discovery) + - [Access control](#access-control) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Pod annotations instead of Pod spec changes](#pod-annotations-instead-of-pod-spec-changes) + - [RDT-only](#rdt-only) + - [Widen the scope](#widen-the-scope) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +We would like to add support for class-based resources in Kubernetes. +Class-based resources can be thought of as non-accountable resources, each of +which is presented by a set of classes. Being non-accountable means that +multiple containers can be assigned to the same class. They are also supposed +to be opaque to the CRI client in the sense that the container runtime takes +care of configuration and control of the resources and the classes within. + +A prime example of a class-based resource is Intel RDT (Resource Director +Technology). RDT is a technology for controlling the cache lines and memory +bandwidth available to applications. RDT provides a class-based approach for +QoS control of these shared resources: all processes in the same hardware class +share a portion of cache lines and memory bandwidth. + +We also believe that the Linux Block IO controller (cgroup) should be handled +as a class-based resource on the level of container orchestration. This enables +configuring I/O scheduler priority and throttling I/O bandwidth per workload. +Having the support for class-based resources in place, it will provide a +framework for the future, for instance class-based network or memory type +prioritization. + +## Motivation + + + +RDT implements a class-based mechanism for controlling the cache and memory +bandwidth QoS of applications, providing a tool for mitigating noisy neighbors +and fulfilling SLAs. In Linux control happens via resctrl -- a +pseudo-filesystem provided by the kernel which makes it virtually agnostic of +the hardware architecture. The OCI runtime-spec has supported Intel RDT for a +while already. Other hardware vendors have comparable technologies which use +the same resctrl interface. + +The Linux Block IO controller parameters depend very heavily on the underlying +hardware and system configuration (device naming/numbering, IO scheduler +configuration etc) which makes it very impractical to control from the Pod spec +level. In order to hide this complexity the concept of blockio classes is being +added to the container runtimes (CRI-O and containerd). A system administrator +is able to configure blockio controller parameters on per-class basis and the +classes are then made available for CRI clients. + +Currently, there is no mechanism in Kubernetes to use these types of resources +in Kubernetes. CRI-O and containerd runtimes have support for RDT and blockio +classes and they provide an bridge-gap user interface through special pod +annotations. We would like to eventually get these types of resources first +class citizen and properly supported in Kubernetes, providing visibility, a +well-defined user interface, and permission controls. + +### Goals + + + +- Make it possible to request class resources + - Support RDT class assignment of containers. This is already supported by + the containerd and CRI-O runtime and part of the OCI runtime-spec + - Support blockio class assignment of containers. +- Make the extensions flexible, enabling simple addition of other class-based + resource types in the future. + +### Non-Goals + + + +- Interface for configuring the class-based resources. +- Enumerating possible (class) resource types or their detailed behavior +- Resource status/capacity (will be addressed in a separate KEP) +- Discovery of the class-based resources (will be addressed in a separate KEP) +- Access control (will be addressed in a separate KEP) + +## Proposal + + + +We extend the CRI protocol and Pod spec to contain information about the +class-based resource assignment of containers. + +Currently we identify two types of resources (RDT and blockio) but this will be +a generic mechanism that will serve other similar resources in the future. + +### User Stories (Optional) + + + +#### Story 1 + +As a user I want to minimize the interference of other applications to my +workload by assigning it to a class with exclusive cache allocation. + +#### Story 2 + +As a user I want to make sure my low-priority, I/O-intensive background task +will not disturb more important workloads running on the same node. + +#### Story 3 + +As a cluster administrator I want to throttle I/O bandwidths of certain +DaemonSets, and I want that exact throttling values depend on the SSD model in +my heterogenous cluster. + +#### Story 4 + +As a user I want to assign a low priority task into an (RDT) class that limits +the available memory bandwidth. + +### Notes/Constraints/Caveats (Optional) + + + +This is only the first step in getting class-based resources supported in +Kubernetes. Important pieces like resource status, resource disovery and +permission control are [non-goals](#non-goals) not solved here. These aspects +are briefly discussed in [future work](#future-work). The risk in this sort of +piecemeal approach is finding devil in the details, resulting in inconsistent +and/or crippled and/or cumbersome end result. However, there is a lot of +experience in extending the API and understanding which sort of solutions are +functional and practical. + +### Risks and Mitigations + + + +- User assigning container to “unauthorized” class, causing interference and + access to unwanted set/amount of resources. This will be addressed in future + KEP introducing permission controls. +- Confusion: user tries to assign container to RDT class but RDT has not been + enabled on system(s). This will be addressed by future KEP(s) introducing + resource discovery and status. +- Keeping client (kubelet) and runtime in sync wrt to available classes. Will + be addressed in future KEP about resource discovery. + +## Design Details + + + +Configuration and management of the resource classes is fully handled by the +underlying container runtime and is invisible to kubelet. An error to the CRI +client is returned if the specified class is not available. + +### CRI protocol + +The following additions to the CRI protocol are suggested. + +The `ContainerConfig` message will be supplemented with new `class_resources` +field, providing per-container setting for class resources. + + +```diff + message ContainerConfig { + + ... + // Configuration specific to Linux containers. + LinuxContainerConfig linux = 15; + // Configuration specific to Windows containers. + WindowsContainerConfig windows = 16; ++ ++ // Configuration of class resources. ++ ContainerClassResources class_resources = 17; + } + ++// ContainerClassResources specifies the configuration of class based ++// resources of a container. ++message ContainerClassResources { ++ // Resource classes of the container will be assigned to ++ map classes = 1; ++} +``` + +The `PodSandboxConfig` will be supplemented with a corresponding +`class_resources` field that will be the Pod level configuration. Depending on +the resource this might be interpreted as a pod-level default (that is used if +nothing is specified in the `ContainerConfig`) or as a true Pod-level setting - +in the end the detailed behavior will be responsibility of the container +runtime. + +```diff + message PodSandboxConfig { +@@ -45,5 +45,14 @@ message PodSandboxConfig { + LinuxPodSandboxConfig linux = 8; + // Optional configurations specific to Windows hosts. + WindowsPodSandboxConfig windows = 9; ++ // Configuration of class resources. ++ PodClassResources class_resources = 10; ++ + } + ++// PodClassResources specifies the configuration of class based ++// resources of a pod. ++message PodClassResources { ++ // Resource classes of the pod will be assigned to ++ map class = 1; ++} +``` + +Also, define "known" class resource types to more easily align container +runtime implementations: + +``` ++ ++const ( ++ // ClassResourceRdt is the name of the RDT class resource ++ ClassResourceRdt = "rdt" ++ // ClassResourceBlockio is the name of the blockio class resource ++ ClassResourceBlockio = "blockio" ++) +``` + +### Pod Spec + +Introduce a new field (e.g. class) into ResourceRequirements of Container. + +```diff +// ResourceRequirements describes the compute resource requirements. +type ResourceRequirements struct { + // Limits describes the maximum amount of compute resources allowed. + Limits ResourceList `json:"limits,omitempty" + // Requests describes the minimum amount of compute resources required. + Requests ResourceList `json:"requests,omitempty" ++ // Classes specifies the resource classes that the container should be assigned ++ Classes map[ClassResourceName]string +} + ++// ClassResourceName is the name of a class-based resource. ++type ClassResourceName string +``` + +Also, we add a `Resources` field to the `PodSpec`. We will re-use the existing +`ResourceRequirements` type but Limits and Requests must be left empty. Classes +may be set and they represent the Pod-level assignment of class resources, +comparable to the PodClassResources message in PodSandboxConfig in the CRI API. + +```diff + type PodSpec struct { +@@ -224,4 +224,8 @@ type PodSpec struct { + // Default to false. + // +optional + SetHostnameAsFQDN *bool `json:"setHostnameAsFQDN,omitempty" protobuf:"varint,35,opt,name=setHostnameAsFQDN"` ++ // Pod-level resources. Currently, requests and limits are not allowed ++ // to be specified for pods. ++ // +optional ++ Resources ResourceRequirements + } +``` + +In practice, the class resource information will be directly used in the CRI +ContainerConfig (e.g. CreateContainerRequest message). At this point, without +resource discovery or access control kubelet does not do any validity checking +of the values. Invalid class assignments will cause an error in the container +runtime. + +Input validation of classes very similar to labels is implemented: keys +(`ClassResourceName`) and values must be non-empty, less than 64 characters +long, must start and end with an alphanumeric character and may contain only +alphanumeric characters, dashes, underscores or dots (`-`, `_` or `.`). +Similar to labels, a namespace prefix (FQDN subdomain separated with a slash) +in the key is allowed, similar to labels, e.g. `vendor/resource`. + +### Container runtimes + +We have open PRs to implement class-based RDT and blockio support in CRI-O and +containerd: + +- cri-o: + - [~~Add support for Intel RDT~~](https://github.com/cri-o/cri-o/pull/4830) + - [~~Support for cgroups blockio~~](https://github.com/cri-o/cri-o/pull/4873) +- containerd: + - [~~Support Intel RDT~~](https://github.com/containerd/containerd/pull/5439) + - [Support for cgroups blockio](https://github.com/containerd/containerd/pull/5490) + +The design paradigm here is that the container runtime configures the resource +classes according to a given configuration file. Enforcement on containers is +done via OCI. + +### Open Questions + +#### Pod QoS class + +The Pod QoS class could be communicated to the container runtime as a class +resource, too. This information is currently internal to kubelet. However, +container runtimes (CRI-O, at least) are already depending on this information +and currently determining it indirectly by evaluating other CRI parameters. It +would be better to explicitly state the Pod QoS class and class resources would +look like a logical place for that. This also makes it techically possible to +have container-specific QoS classes (as a possible future enhancement of K8s). + +Communicating Pod QoS class via class resources would advocate moving class +resources up to `ContainerConfig`. + +Making this change, it would also be possible to separate `oom_score_adj` from +the pod qos class in the future. The runtime could provide a set of OOM +classes, making it possible for the user to specify a burstable pod with low +oom priority (low chance of being killed). + +### Default class + +A mechanism for indicating that the (runtime) default class should be used. The +default class would/should be a node/runtime specific attribute. How should +this be specified in the CRI protocol/`cri-api` and Pod spec? + +### Test Plan + + + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +### Future work + +These topics were stated in [Non-goals](#non-goals) and thus they are strictly +out of the scope of this KEP. However, the sections below briefly outline some +possible solutions for those, in order to better evaluate this KEP in a broader +context. + +#### Resource status/capacity + +This KEP does not speak out anything about presenting the available resource +types (or classes within) to the users. + +Some alternatives for presenting this information: + +1. Supplement `NodeStatus` + + ```diff + // NodeStatus is information about the current status of a node. + type NodeStatus struct { + // Capacity represents the total resources of a node. + // More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#capacity + // +optional + Capacity ResourceList `json:"capacity,omitempty" protobuf:"bytes,1,rep,name=capacity,casttype=ResourceList,castkey=ResourceName"` + // Allocatable represents the resources of a node that are available for scheduling. + // Defaults to Capacity. + // +optional + Allocatable ResourceList `json:"allocatable,omitempty" protobuf:"bytes,2,rep,name=allocatable,casttype=ResourceList,castkey=ResourceName"` + + // ResourceClasses lists the available + + ClassResourdes []ClassResourceList + + + +type ClassResourceList { + + // Name of the resource + + Name ClassResourceName + + // Classes available in the resource + + Classes []string + +} + ``` +1. Separate API objects (e.g. something like `RuntimeClass`). Doesn't + necessarily that neatly align with two level hierarchy (resource name and a + set of classes within). Also, only best suited to homogenous clusters. + +#### Resource discovery + +Some possible alternatives. + +1. Reported by the container runtime. Container runtime is (or at least should + be) aware of all resource types and the classes within. It could advertise + the resources e.g. via either: + + 1. A separate gRPC endpoint or update `StatusResponse + 1. OR Populate a (json) file in a known location + + As a reference, the API currently allows listing of some objects/resources + (Pods, Containers, Images etc) but not some others. + +1. Manual configuration. Would be best suited for case where resources and + classes would be presented as separate API objects. + +#### Access control + +If class resources were advertised as API objects the natural access +control mechanism would be through RBAC. + +If class resources were advertised in node status (similar to other resources), +access control could be achieved e.g. by extending ResourceQuotaSpec which would implement restrictions based on the namespace. + +```diff + // ResourceQuotaSpec defines the desired hard limits to enforce for Quota. + type ResourceQuotaSpec struct { + // hard is the set of desired hard limits for each named resource. + Hard ResourceList + // A collection of filters that must match each object tracked by a quota. + // If not specified, the quota matches all objects. + Scopes []ResourceQuotaScope + // scopeSelector is also a collection of filters like scopes that must match each + // object tracked by a quota but expressed using ScopeSelectorOperator in combination + // with possible values. + ScopeSelector *ScopeSelector ++ // AllowedClasses specifies the list of allowed classes for each class-based resource ++ AllowedClasses map[ClassResourceName]ResourceClassList +} + ++// ResourceClassList is a list of classes of a specific type of class-based resource. ++type ResourceClassList []string +``` + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: + - Components depending on the feature gate: +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). + +###### Does enabling the feature change any default behavior? + + + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +No, enabling or using the feature does not induce any new API calls in +Kubernetes. + +###### Will enabling / using this feature result in introducing new API types? + + + +Class resources do extend existing API types but not introduce new types of +objects. However, future work (KEPs) enabling resource discovery and permission +control might change this. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +No, enabling or using the feature does not result in any new calls to the cloud +provider. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +A new field in `ResourceRequirements` (of `Container`) will increase the size +of `Pod` objects by a bytes per class requested. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +### Pod annotations instead of Pod spec changes + +Instead of updating CRI and Pod spec in lock-step, the API change could be +split into two phases, similar to e.g. how seccomp support was added. Adding +support for Pod annotations would provide an initial user interface (behind a +feature gate) for the feature and enable easier testing/verification. These +would bridge the gap between enabling class-based resources in the CRI protocol +and making them available in the Pod spec. + + +1. In the first phase only update the CRI API use Pod annotations +as an intermediate solution for specifying class resources +2. In the second phase deprecate Pod annotations and update the Pod spec + +A feature gate ResourceClassPodAnnotations would be added kubelet to look for +pod annotations and set the RDT and blockio class of containers via CRI +protocol accordingly: + +- `rdt.resources.beta.kubernetes.io/pod` for setting a Pod-level default RDT + class for all containers +- `rdt.resources.beta.kubernetes.io/container.` for + container-specific RDT class settings +- `blockio.resources.beta.kubernetes.io/pod` for setting a Pod-level default + blockio class for all containers +- `blockio.resources.beta.kubernetes.io/container.` for + container-specific blockio class settings + +### RDT-only + +The scope of the KEP could be narrowed down by concentrating on RDT only, +dropping support for blockio. This would center the focus on RDT only which is +well understood and specified in the OCI runtime specification. + +### Widen the scope + +The currently chosen strategy of this KEP is "minimum viable product" with +incremental future steps of improving and supplementing the functionality. This +strategy was chosen in order to make the review easier by handling smaller +digestible (but still coherent and self-contained) chunks at a time. + +An alternaive would be to widen the scope of this KEP to include some or all of +the subjects mentioned in [future work](#future-work) (i.e. resource discovery, +status/capacity and access control). + +## Infrastructure Needed (Optional) + + + +For proper end-to-end testing of RDT, a cluster with nodes that have RDT +enabled would be required. Similarly, for end-to-end testing of blockio, nodes +with blockio cgroup controller and suitable i/o scheduler enabled would be +required. From e00244e3b068e3d45e479a44ffe47db3fd1bb734 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Mon, 6 Jun 2022 17:27:46 +0300 Subject: [PATCH 02/92] KEP-3008: use pod annotations instead of pod spec Narrow the scope to essentially only cover changes to CRI API. Pod spec changes moved to future work. --- .../3008-cri-class-based-resources/README.md | 166 ++++++++++-------- 1 file changed, 90 insertions(+), 76 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index bc3368e9a5d..696939507a3 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -276,11 +276,17 @@ The "Design Details" section below is for the real nitty-gritty. --> -We extend the CRI protocol and Pod spec to contain information about the -class-based resource assignment of containers. +We extend the CRI protocol to contain information about the class-based +resource assignment of containers. Currently we identify two types of +resources (RDT and blockio) but the API changes will be generic so that it that +will serve other similar resources in the future. -Currently we identify two types of resources (RDT and blockio) but this will be -a generic mechanism that will serve other similar resources in the future. +We implement pod annotations the initial mechanism for Kubernetes users to +control class resource assignment. We define two class resources that can be +controlled via annotations, i.e. RDT and blockio. + +We introduce a feature gate that enables kubelet to interpret pod annotations +for controlling the RDT and blockio class of containers. ### User Stories (Optional) @@ -322,8 +328,10 @@ This might be a good place to talk about core concepts and how they relate. --> This is only the first step in getting class-based resources supported in -Kubernetes. Important pieces like resource status, resource disovery and -permission control are [non-goals](#non-goals) not solved here. These aspects +Kubernetes. Important pieces like resource assignment via pod spec, resource +status, resource disovery and permission control are [non-goals](#non-goals) +not solved here. +These aspects are briefly discussed in [future work](#future-work). The risk in this sort of piecemeal approach is finding devil in the details, resulting in inconsistent and/or crippled and/or cumbersome end result. However, there is a lot of @@ -434,55 +442,28 @@ runtime implementations: +) ``` -### Pod Spec +### Pod annotations -Introduce a new field (e.g. class) into ResourceRequirements of Container. +Use Pod annotation as the initial K8s user interface, similar to e.g. how +seccomp support was added. This will bridge the gap between enabling +class-based resources in the CRI protocol and making them available in the Pod +spec. -```diff -// ResourceRequirements describes the compute resource requirements. -type ResourceRequirements struct { - // Limits describes the maximum amount of compute resources allowed. - Limits ResourceList `json:"limits,omitempty" - // Requests describes the minimum amount of compute resources required. - Requests ResourceList `json:"requests,omitempty" -+ // Classes specifies the resource classes that the container should be assigned -+ Classes map[ClassResourceName]string -} - -+// ClassResourceName is the name of a class-based resource. -+type ClassResourceName string -``` +A feature gate ClassResourcePodAnnotations enables kubelet to look for pod +annotations and set the class resource assignment via CRI protocol accordingly. -Also, we add a `Resources` field to the `PodSpec`. We will re-use the existing -`ResourceRequirements` type but Limits and Requests must be left empty. Classes -may be set and they represent the Pod-level assignment of class resources, -comparable to the PodClassResources message in PodSandboxConfig in the CRI API. +Specifically, kubelet will support annotations for specifying RDT and blockio +class, the two types of class resources that already have basic support in the +container runtimes. -```diff - type PodSpec struct { -@@ -224,4 +224,8 @@ type PodSpec struct { - // Default to false. - // +optional - SetHostnameAsFQDN *bool `json:"setHostnameAsFQDN,omitempty" protobuf:"varint,35,opt,name=setHostnameAsFQDN"` -+ // Pod-level resources. Currently, requests and limits are not allowed -+ // to be specified for pods. -+ // +optional -+ Resources ResourceRequirements - } -``` - -In practice, the class resource information will be directly used in the CRI -ContainerConfig (e.g. CreateContainerRequest message). At this point, without -resource discovery or access control kubelet does not do any validity checking -of the values. Invalid class assignments will cause an error in the container -runtime. - -Input validation of classes very similar to labels is implemented: keys -(`ClassResourceName`) and values must be non-empty, less than 64 characters -long, must start and end with an alphanumeric character and may contain only -alphanumeric characters, dashes, underscores or dots (`-`, `_` or `.`). -Similar to labels, a namespace prefix (FQDN subdomain separated with a slash) -in the key is allowed, similar to labels, e.g. `vendor/resource`. +- `rdt.resources.beta.kubernetes.io/pod` for setting a Pod-level default RDT + class for all containers +- `rdt.resources.beta.kubernetes.io/container.` for + container-specific RDT class settings +- `blockio.resources.beta.kubernetes.io/pod` for setting a Pod-level default + blockio class for all containers +- `blockio.resources.beta.kubernetes.io/container.` for + container-specific blockio class settings ### Container runtimes @@ -644,6 +625,59 @@ out of the scope of this KEP. However, the sections below briefly outline some possible solutions for those, in order to better evaluate this KEP in a broader context. +### Pod Spec + +Replace pod annotations with proper user interface via the Pod spec. Below, one +possible option is presented. + +Introduce a new field (e.g. class) into ResourceRequirements of Container. + +```diff +// ResourceRequirements describes the compute resource requirements. +type ResourceRequirements struct { + // Limits describes the maximum amount of compute resources allowed. + Limits ResourceList `json:"limits,omitempty" + // Requests describes the minimum amount of compute resources required. + Requests ResourceList `json:"requests,omitempty" ++ // Classes specifies the resource classes that the container should be assigned ++ Classes map[ClassResourceName]string +} + ++// ClassResourceName is the name of a class-based resource. ++type ClassResourceName string +``` + +Also, we add a `Resources` field to the `PodSpec`. We will re-use the existing +`ResourceRequirements` type but Limits and Requests must be left empty. Classes +may be set and they represent the Pod-level assignment of class resources, +comparable to the PodClassResources message in PodSandboxConfig in the CRI API. + +```diff + type PodSpec struct { +@@ -224,4 +224,8 @@ type PodSpec struct { + // Default to false. + // +optional + SetHostnameAsFQDN *bool `json:"setHostnameAsFQDN,omitempty" protobuf:"varint,35,opt,name=setHostnameAsFQDN"` ++ // Pod-level resources. Currently, requests and limits are not allowed ++ // to be specified for pods. ++ // +optional ++ Resources ResourceRequirements + } +``` + +In practice, the class resource information will be directly used in the CRI +ContainerConfig (e.g. CreateContainerRequest message). At this point, without +resource discovery or access control kubelet does not do any validity checking +of the values. Invalid class assignments will cause an error in the container +runtime. + +Input validation of classes very similar to labels is implemented: keys +(`ClassResourceName`) and values must be non-empty, less than 64 characters +long, must start and end with an alphanumeric character and may contain only +alphanumeric characters, dashes, underscores or dots (`-`, `_` or `.`). +Similar to labels, a namespace prefix (FQDN subdomain separated with a slash) +in the key is allowed, similar to labels, e.g. `vendor/resource`. + #### Resource status/capacity This KEP does not speak out anything about presenting the available resource @@ -1073,32 +1107,12 @@ not need to be as detailed as the proposal, but should include enough information to express the idea and why it was not acceptable. --> -### Pod annotations instead of Pod spec changes - -Instead of updating CRI and Pod spec in lock-step, the API change could be -split into two phases, similar to e.g. how seccomp support was added. Adding -support for Pod annotations would provide an initial user interface (behind a -feature gate) for the feature and enable easier testing/verification. These -would bridge the gap between enabling class-based resources in the CRI protocol -and making them available in the Pod spec. - +### Pod spec -1. In the first phase only update the CRI API use Pod annotations -as an intermediate solution for specifying class resources -2. In the second phase deprecate Pod annotations and update the Pod spec - -A feature gate ResourceClassPodAnnotations would be added kubelet to look for -pod annotations and set the RDT and blockio class of containers via CRI -protocol accordingly: - -- `rdt.resources.beta.kubernetes.io/pod` for setting a Pod-level default RDT - class for all containers -- `rdt.resources.beta.kubernetes.io/container.` for - container-specific RDT class settings -- `blockio.resources.beta.kubernetes.io/pod` for setting a Pod-level default - blockio class for all containers -- `blockio.resources.beta.kubernetes.io/container.` for - container-specific blockio class settings +Instead of introducing Pod annotations as an intermediate solution for +controlling the class resources, the Pod spec could be updated in lock-step +with the CRI api. See the section [(Future work) Pod spec](#pod-spec) for more +details. ### RDT-only From 23ca44791aa23338acdd59e39ea51e1539d56401 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Mon, 6 Jun 2022 17:38:04 +0300 Subject: [PATCH 03/92] KEP-3008: fix wording and terminology --- .../3008-cri-class-based-resources/README.md | 20 +++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 696939507a3..0fc204499e6 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -232,11 +232,11 @@ is able to configure blockio controller parameters on per-class basis and the classes are then made available for CRI clients. Currently, there is no mechanism in Kubernetes to use these types of resources -in Kubernetes. CRI-O and containerd runtimes have support for RDT and blockio -classes and they provide an bridge-gap user interface through special pod -annotations. We would like to eventually get these types of resources first -class citizen and properly supported in Kubernetes, providing visibility, a -well-defined user interface, and permission controls. +. CRI-O and containerd runtimes have support for RDT and blockio classes and +they provide an bridge-gap user interface through special pod annotations. We +would like to eventually get these types of resources first class citizen and +properly supported in Kubernetes, providing visibility, a well-defined user +interface, and permission controls. ### Goals @@ -370,7 +370,7 @@ required) or even code snippets. If there's any ambiguity about HOW your proposal will be implemented, this is the place to discuss them. --> -Configuration and management of the resource classes is fully handled by the +Configuration and management of the class resources is fully handled by the underlying container runtime and is invisible to kubelet. An error to the CRI client is returned if the specified class is not available. @@ -467,7 +467,7 @@ container runtimes. ### Container runtimes -We have open PRs to implement class-based RDT and blockio support in CRI-O and +We have implemented class-based RDT and blockio support in CRI-O and containerd: - cri-o: @@ -475,11 +475,11 @@ containerd: - [~~Support for cgroups blockio~~](https://github.com/cri-o/cri-o/pull/4873) - containerd: - [~~Support Intel RDT~~](https://github.com/containerd/containerd/pull/5439) - - [Support for cgroups blockio](https://github.com/containerd/containerd/pull/5490) + - [~~Support for cgroups blockio~~](https://github.com/containerd/containerd/pull/5490) The design paradigm here is that the container runtime configures the resource classes according to a given configuration file. Enforcement on containers is -done via OCI. +done via OCI. User interface is provided through pod and container annotations. ### Open Questions @@ -639,7 +639,7 @@ type ResourceRequirements struct { Limits ResourceList `json:"limits,omitempty" // Requests describes the minimum amount of compute resources required. Requests ResourceList `json:"requests,omitempty" -+ // Classes specifies the resource classes that the container should be assigned ++ // Classes specifies the class resources that the container should be assigned + Classes map[ClassResourceName]string } From 272fe11d9784f2113e0396ab1859c3b0fd3048aa Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Wed, 8 Jun 2022 11:50:02 +0300 Subject: [PATCH 04/92] Update keps/sig-node/3008-cri-class-based-resources/README.md Co-authored-by: Tyler Stapler --- keps/sig-node/3008-cri-class-based-resources/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 0fc204499e6..69eba51d44c 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -720,7 +720,7 @@ Some possible alternatives. be) aware of all resource types and the classes within. It could advertise the resources e.g. via either: - 1. A separate gRPC endpoint or update `StatusResponse + 1. A separate gRPC endpoint or update `StatusResponse` 1. OR Populate a (json) file in a known location As a reference, the API currently allows listing of some objects/resources From 3b511f439df243989ed59f9c72769906debb98c6 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Wed, 8 Jun 2022 14:19:00 +0300 Subject: [PATCH 05/92] KEP-3008: update toc --- keps/sig-node/3008-cri-class-based-resources/README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 69eba51d44c..faf6dd529d5 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -92,7 +92,7 @@ tags, and then generate with `hack/update-toc.sh`. - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) - [CRI protocol](#cri-protocol) - - [Pod Spec](#pod-spec) + - [Pod annotations](#pod-annotations) - [Container runtimes](#container-runtimes) - [Open Questions](#open-questions) - [Pod QoS class](#pod-qos-class) @@ -102,6 +102,7 @@ tags, and then generate with `hack/update-toc.sh`. - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Version Skew Strategy](#version-skew-strategy) - [Future work](#future-work) + - [Pod Spec](#pod-spec) - [Resource status/capacity](#resource-statuscapacity) - [Resource discovery](#resource-discovery) - [Access control](#access-control) @@ -115,7 +116,7 @@ tags, and then generate with `hack/update-toc.sh`. - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) - - [Pod annotations instead of Pod spec changes](#pod-annotations-instead-of-pod-spec-changes) + - [Pod spec](#pod-spec-1) - [RDT-only](#rdt-only) - [Widen the scope](#widen-the-scope) - [Infrastructure Needed (Optional)](#infrastructure-needed-optional) From ec54a5fc53e9cbf1486bc3f150c84c8045b0fdbb Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Mon, 13 Jun 2022 17:00:41 +0300 Subject: [PATCH 06/92] KEP-3008: sync with latest KEP template --- .../3008-cri-class-based-resources/README.md | 103 ++++++++++++++++-- 1 file changed, 92 insertions(+), 11 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index faf6dd529d5..8575b281250 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -98,6 +98,10 @@ tags, and then generate with `hack/update-toc.sh`. - [Pod QoS class](#pod-qos-class) - [Default class](#default-class) - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) - [Graduation Criteria](#graduation-criteria) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Version Skew Strategy](#version-skew-strategy) @@ -512,14 +516,7 @@ this be specified in the CRI protocol/`cri-api` and Pod spec? +[ ] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + +- : + +##### e2e tests + + + +- : + ### Graduation Criteria - [ ] Feature gate (also fill in values in `kep.yaml`) @@ -817,6 +882,10 @@ automations, so be extremely careful here. Describe the consequences on existing workloads (e.g., if this is a runtime feature, can it break the existing applications?). +Feature gates are typically disabled by setting the flag to `false` and +restarting the component. No other changes should be necessary to disable the +feature. + NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`. --> @@ -829,6 +898,12 @@ The e2e framework does not currently support enabling or disabling feature gates. However, unit tests in each component dealing with managing data, created with and without the feature, are necessary. At the very least, think about conversion tests if API types are being modified. + +Additionally, for features that are introducing a new API field, unit tests that +are exercising the `switch` of feature gate itself (what happens if I disable a +feature gate after having objects written with the new field) are also critical. +You can take a look at one potential example of such test in: +https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282 --> ### Rollout, Upgrade and Rollback Planning @@ -874,6 +949,9 @@ Even if applying deprecation policies, they may still surprise some users. ###### How can an operator determine if the feature is in use by workloads? @@ -1057,6 +1135,9 @@ This through this both in small and large cases, again with respect to the +#### Beta + +- Gather feedback from developers and surveys +- In addition to the simple change in CRI API, implement the following + - Pod spec update + - Resource discovery + - Resource status/capacity (with scheduling) + - Parmission control +- Well-defined behavior with [In-place pod vertical scaling](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources) +- Additional tests are in Testgrid and linked in KEP + +#### GA + +- More rigorous forms of testing—e.g., downgrade tests and scalability tests +- Allowing time for feedback + + ### Upgrade / Downgrade Strategy -[ ] I/we understand the owners of the involved components may require updates to +[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement. @@ -559,7 +559,8 @@ This can inform certain test coverage improvements that we want to do before extending the production code to implement this enhancement. --> -- ``: `` - `` +- `k8s.io/kubernetes/pkg/kubelet/kuberuntime`: `2022-06-13` - `66.8%` +- `k8s.io/kubernetes/pkg/apis/core/validation/validation.go`: `2022-06-13` - `82.1%` ##### Integration tests @@ -571,6 +572,11 @@ For Beta and GA, add links to added tests together with links to k8s-triage for https://storage.googleapis.com/k8s-triage/index.html --> +Alpha: no specific integration tests are planned for Alpha. + +Beta: Existing integration tests for affected components (e.g. scheduler, node +status, quota) are extended to cover class resources. + - : ##### e2e tests @@ -585,7 +591,10 @@ https://storage.googleapis.com/k8s-triage/index.html We expect no non-infra related flakes in the last month as a GA graduation criteria. --> -- : +Alpha: no specific e2e-tests are planned. + +In order to be able to run e2e tests, a cluster with nodes having runtime +support for class resources is required. ### Graduation Criteria From e70a27e1cb28033fd271ca9e1e0fe42714967021 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Mon, 13 Jun 2022 18:53:46 +0300 Subject: [PATCH 09/92] KEP-3008: add kep.yaml Update feature gate name in KEP. --- .../3008-cri-class-based-resources/README.md | 2 +- .../3008-cri-class-based-resources/kep.yaml | 33 +++++++++++++++++++ 2 files changed, 34 insertions(+), 1 deletion(-) create mode 100644 keps/sig-node/3008-cri-class-based-resources/kep.yaml diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index ca17d1563c3..e3382a5c381 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -456,7 +456,7 @@ seccomp support was added. This will bridge the gap between enabling class-based resources in the CRI protocol and making them available in the Pod spec. -A feature gate ClassResourcePodAnnotations enables kubelet to look for pod +A feature gate ClassResources enables kubelet to look for pod annotations and set the class resource assignment via CRI protocol accordingly. Specifically, kubelet will support annotations for specifying RDT and blockio diff --git a/keps/sig-node/3008-cri-class-based-resources/kep.yaml b/keps/sig-node/3008-cri-class-based-resources/kep.yaml new file mode 100644 index 00000000000..3a323874f00 --- /dev/null +++ b/keps/sig-node/3008-cri-class-based-resources/kep.yaml @@ -0,0 +1,33 @@ +title: Support User Namespaces +kep-number: 3008 +authors: + - "@marquiz" +owning-sig: sig-node +participating-sigs: [] +status: provisional +creation-date: 2021-10-07 +reviewers: [] +approvers: [] + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.25" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.25" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: ClassResources + components: + - kubelet +disable-supported: true + +# The following PRR answers are required at beta release +metrics: [] From 56290f1ae0c9ae2f073d5688711cadb538e19088 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Mon, 13 Jun 2022 19:04:04 +0300 Subject: [PATCH 10/92] KEP-3008: add prod-readiness/sig-node/3008.yaml Placeholder. --- keps/prod-readiness/sig-node/3008.yaml | 6 ++++++ 1 file changed, 6 insertions(+) create mode 100644 keps/prod-readiness/sig-node/3008.yaml diff --git a/keps/prod-readiness/sig-node/3008.yaml b/keps/prod-readiness/sig-node/3008.yaml new file mode 100644 index 00000000000..f2256689af2 --- /dev/null +++ b/keps/prod-readiness/sig-node/3008.yaml @@ -0,0 +1,6 @@ +# The KEP must have an approver from the +# "prod-readiness-approvers" group +# of http://git.k8s.io/enhancements/OWNERS_ALIASES +kep-number: 3008 +alpha: + approver: "" From 861e0e806530fcc0b4d6f299ba637f838aebd787 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 14 Jun 2022 21:09:56 +0300 Subject: [PATCH 11/92] KEP-3008: move "future work" around Also slight re-wording. --- .../3008-cri-class-based-resources/README.md | 294 +++++++++--------- 1 file changed, 148 insertions(+), 146 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index e3382a5c381..2aac12aaac4 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -96,7 +96,12 @@ tags, and then generate with `hack/update-toc.sh`. - [Container runtimes](#container-runtimes) - [Open Questions](#open-questions) - [Pod QoS class](#pod-qos-class) - - [Default class](#default-class) + - [Default class](#default-class) + - [Future work](#future-work) + - [Pod Spec](#pod-spec) + - [Resource status/capacity](#resource-statuscapacity) + - [Resource discovery](#resource-discovery) + - [Access control](#access-control) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) - [Unit tests](#unit-tests) @@ -107,11 +112,6 @@ tags, and then generate with `hack/update-toc.sh`. - [GA](#ga) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Version Skew Strategy](#version-skew-strategy) - - [Future work](#future-work) - - [Pod Spec](#pod-spec) - - [Resource status/capacity](#resource-statuscapacity) - - [Resource discovery](#resource-discovery) - - [Access control](#access-control) - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) - [Feature Enablement and Rollback](#feature-enablement-and-rollback) - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) @@ -508,12 +508,153 @@ the pod qos class in the future. The runtime could provide a set of OOM classes, making it possible for the user to specify a burstable pod with low oom priority (low chance of being killed). -### Default class +#### Default class A mechanism for indicating that the (runtime) default class should be used. The default class would/should be a node/runtime specific attribute. How should this be specified in the CRI protocol/`cri-api` and Pod spec? +### Future work + +This section sheds light on the end goal of this work in order to better +evaluate this KEP in a broader context. What a fully working solution would +consists of and what the (next) steps to accomplish that would be. These topics +are currently out of the scope of this KEP and were listed under +[Non-goals](#non-goals). + +#### Pod Spec + +Replace pod annotations with proper user interface via the Pod spec. Below, one +possible option is presented. + +Introduce a new field (e.g. class) into ResourceRequirements of Container. + +```diff +// ResourceRequirements describes the compute resource requirements. +type ResourceRequirements struct { + // Limits describes the maximum amount of compute resources allowed. + Limits ResourceList `json:"limits,omitempty" + // Requests describes the minimum amount of compute resources required. + Requests ResourceList `json:"requests,omitempty" ++ // Classes specifies the class resources that the container should be assigned ++ Classes map[ClassResourceName]string +} + ++// ClassResourceName is the name of a class-based resource. ++type ClassResourceName string +``` + +Also, we add a `Resources` field to the `PodSpec`. We will re-use the existing +`ResourceRequirements` type but Limits and Requests must be left empty. Classes +may be set and they represent the Pod-level assignment of class resources, +comparable to the PodClassResources message in PodSandboxConfig in the CRI API. + +```diff + type PodSpec struct { +@@ -224,4 +224,8 @@ type PodSpec struct { + // Default to false. + // +optional + SetHostnameAsFQDN *bool `json:"setHostnameAsFQDN,omitempty" protobuf:"varint,35,opt,name=setHostnameAsFQDN"` ++ // Pod-level resources. Currently, requests and limits are not allowed ++ // to be specified for pods. ++ // +optional ++ Resources ResourceRequirements + } +``` + +In practice, the class resource information will be directly used in the CRI +ContainerConfig (e.g. CreateContainerRequest message). At this point, without +resource discovery or access control kubelet does not do any validity checking +of the values. Invalid class assignments will cause an error in the container +runtime. + +Input validation of classes very similar to labels is implemented: keys +(`ClassResourceName`) and values must be non-empty, less than 64 characters +long, must start and end with an alphanumeric character and may contain only +alphanumeric characters, dashes, underscores or dots (`-`, `_` or `.`). +Similar to labels, a namespace prefix (FQDN subdomain separated with a slash) +in the key is allowed, similar to labels, e.g. `vendor/resource`. + +#### Resource status/capacity + +This KEP does not speak out anything about presenting the available resource +types (or classes within) to the users. + +Some alternatives for presenting this information: + +1. Supplement `NodeStatus` + + ```diff + // NodeStatus is information about the current status of a node. + type NodeStatus struct { + // Capacity represents the total resources of a node. + // More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#capacity + // +optional + Capacity ResourceList `json:"capacity,omitempty" protobuf:"bytes,1,rep,name=capacity,casttype=ResourceList,castkey=ResourceName"` + // Allocatable represents the resources of a node that are available for scheduling. + // Defaults to Capacity. + // +optional + Allocatable ResourceList `json:"allocatable,omitempty" protobuf:"bytes,2,rep,name=allocatable,casttype=ResourceList,castkey=ResourceName"` + + // ResourceClasses lists the available + + ClassResourdes []ClassResourceList + + + +type ClassResourceList { + + // Name of the resource + + Name ClassResourceName + + // Classes available in the resource + + Classes []string + +} + ``` +1. Separate API objects (e.g. something like `RuntimeClass`). Doesn't + necessarily that neatly align with two level hierarchy (resource name and a + set of classes within). Also, only best suited to homogenous clusters. + +#### Resource discovery + +Some possible alternatives. + +1. Reported by the container runtime. Container runtime is (or at least should + be) aware of all resource types and the classes within. It could advertise + the resources e.g. via either: + + 1. A separate gRPC endpoint or update `StatusResponse` + 1. OR Populate a (json) file in a known location + + As a reference, the API currently allows listing of some objects/resources + (Pods, Containers, Images etc) but not some others. + +1. Manual configuration. Would be best suited for case where resources and + classes would be presented as separate API objects. + +#### Access control + +If class resources were advertised as API objects the natural access +control mechanism would be through RBAC. + +If class resources were advertised in node status (similar to other resources), +access control could be achieved e.g. by extending ResourceQuotaSpec which would implement restrictions based on the namespace. + +```diff + // ResourceQuotaSpec defines the desired hard limits to enforce for Quota. + type ResourceQuotaSpec struct { + // hard is the set of desired hard limits for each named resource. + Hard ResourceList + // A collection of filters that must match each object tracked by a quota. + // If not specified, the quota matches all objects. + Scopes []ResourceQuotaScope + // scopeSelector is also a collection of filters like scopes that must match each + // object tracked by a quota but expressed using ScopeSelectorOperator in combination + // with possible values. + ScopeSelector *ScopeSelector ++ // AllowedClasses specifies the list of allowed classes for each class-based resource ++ AllowedClasses map[ClassResourceName]ResourceClassList +} + ++// ResourceClassList is a list of classes of a specific type of class-based resource. ++type ResourceClassList []string +``` + + ### Test Plan -### Future work - -These topics were stated in [Non-goals](#non-goals) and thus they are strictly -out of the scope of this KEP. However, the sections below briefly outline some -possible solutions for those, in order to better evaluate this KEP in a broader -context. - -### Pod Spec - -Replace pod annotations with proper user interface via the Pod spec. Below, one -possible option is presented. - -Introduce a new field (e.g. class) into ResourceRequirements of Container. - -```diff -// ResourceRequirements describes the compute resource requirements. -type ResourceRequirements struct { - // Limits describes the maximum amount of compute resources allowed. - Limits ResourceList `json:"limits,omitempty" - // Requests describes the minimum amount of compute resources required. - Requests ResourceList `json:"requests,omitempty" -+ // Classes specifies the class resources that the container should be assigned -+ Classes map[ClassResourceName]string -} - -+// ClassResourceName is the name of a class-based resource. -+type ClassResourceName string -``` - -Also, we add a `Resources` field to the `PodSpec`. We will re-use the existing -`ResourceRequirements` type but Limits and Requests must be left empty. Classes -may be set and they represent the Pod-level assignment of class resources, -comparable to the PodClassResources message in PodSandboxConfig in the CRI API. - -```diff - type PodSpec struct { -@@ -224,4 +224,8 @@ type PodSpec struct { - // Default to false. - // +optional - SetHostnameAsFQDN *bool `json:"setHostnameAsFQDN,omitempty" protobuf:"varint,35,opt,name=setHostnameAsFQDN"` -+ // Pod-level resources. Currently, requests and limits are not allowed -+ // to be specified for pods. -+ // +optional -+ Resources ResourceRequirements - } -``` - -In practice, the class resource information will be directly used in the CRI -ContainerConfig (e.g. CreateContainerRequest message). At this point, without -resource discovery or access control kubelet does not do any validity checking -of the values. Invalid class assignments will cause an error in the container -runtime. - -Input validation of classes very similar to labels is implemented: keys -(`ClassResourceName`) and values must be non-empty, less than 64 characters -long, must start and end with an alphanumeric character and may contain only -alphanumeric characters, dashes, underscores or dots (`-`, `_` or `.`). -Similar to labels, a namespace prefix (FQDN subdomain separated with a slash) -in the key is allowed, similar to labels, e.g. `vendor/resource`. - -#### Resource status/capacity - -This KEP does not speak out anything about presenting the available resource -types (or classes within) to the users. - -Some alternatives for presenting this information: - -1. Supplement `NodeStatus` - - ```diff - // NodeStatus is information about the current status of a node. - type NodeStatus struct { - // Capacity represents the total resources of a node. - // More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#capacity - // +optional - Capacity ResourceList `json:"capacity,omitempty" protobuf:"bytes,1,rep,name=capacity,casttype=ResourceList,castkey=ResourceName"` - // Allocatable represents the resources of a node that are available for scheduling. - // Defaults to Capacity. - // +optional - Allocatable ResourceList `json:"allocatable,omitempty" protobuf:"bytes,2,rep,name=allocatable,casttype=ResourceList,castkey=ResourceName"` - + // ResourceClasses lists the available - + ClassResourdes []ClassResourceList - + - +type ClassResourceList { - + // Name of the resource - + Name ClassResourceName - + // Classes available in the resource - + Classes []string - +} - ``` -1. Separate API objects (e.g. something like `RuntimeClass`). Doesn't - necessarily that neatly align with two level hierarchy (resource name and a - set of classes within). Also, only best suited to homogenous clusters. - -#### Resource discovery - -Some possible alternatives. - -1. Reported by the container runtime. Container runtime is (or at least should - be) aware of all resource types and the classes within. It could advertise - the resources e.g. via either: - - 1. A separate gRPC endpoint or update `StatusResponse` - 1. OR Populate a (json) file in a known location - - As a reference, the API currently allows listing of some objects/resources - (Pods, Containers, Images etc) but not some others. - -1. Manual configuration. Would be best suited for case where resources and - classes would be presented as separate API objects. - -#### Access control - -If class resources were advertised as API objects the natural access -control mechanism would be through RBAC. - -If class resources were advertised in node status (similar to other resources), -access control could be achieved e.g. by extending ResourceQuotaSpec which would implement restrictions based on the namespace. - -```diff - // ResourceQuotaSpec defines the desired hard limits to enforce for Quota. - type ResourceQuotaSpec struct { - // hard is the set of desired hard limits for each named resource. - Hard ResourceList - // A collection of filters that must match each object tracked by a quota. - // If not specified, the quota matches all objects. - Scopes []ResourceQuotaScope - // scopeSelector is also a collection of filters like scopes that must match each - // object tracked by a quota but expressed using ScopeSelectorOperator in combination - // with possible values. - ScopeSelector *ScopeSelector -+ // AllowedClasses specifies the list of allowed classes for each class-based resource -+ AllowedClasses map[ClassResourceName]ResourceClassList -} - -+// ResourceClassList is a list of classes of a specific type of class-based resource. -+type ResourceClassList []string -``` - ## Production Readiness Review Questionnaire -- Interface for configuring the class-based resources. +- Interface or mechanism for configuring the class resources (responsibility of + the container runtime). - Enumerating possible (class) resource types or their detailed behavior - Resource status/capacity (will be addressed in a separate KEP) - Discovery of the class-based resources (will be addressed in a separate KEP) @@ -566,7 +567,8 @@ In practice, the class resource information will be directly used in the CRI ContainerConfig (e.g. CreateContainerRequest message). At this point, without resource discovery or access control kubelet does not do any validity checking of the values. Invalid class assignments will cause an error in the container -runtime. +runtime which causes the corresponding CRI RuntimeService request (e.g. +RunPodSandbox or CreateContainer) to fail with an error. Input validation of classes very similar to labels is implemented: keys (`ClassResourceName`) and values must be non-empty, less than 64 characters From 1c360df70743134840e17bd6fad55fd1bcd390ab Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Wed, 15 Jun 2022 17:45:35 +0300 Subject: [PATCH 13/92] KEP-3008: move "Future work" section Move the Future work section upwards, under a new Implementation phases main section. --- .../3008-cri-class-based-resources/README.md | 313 ++++++++++-------- 1 file changed, 167 insertions(+), 146 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 35ec1de14cb..0d49a95d6ca 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -82,6 +82,13 @@ tags, and then generate with `hack/update-toc.sh`. - [Motivation](#motivation) - [Goals](#goals) - [Non-Goals](#non-goals) +- [Implementation phases](#implementation-phases) + - [Phase 1](#phase-1) + - [Future work](#future-work) + - [Pod Spec](#pod-spec) + - [Resource status/capacity](#resource-statuscapacity) + - [Resource discovery](#resource-discovery) + - [Access control](#access-control) - [Proposal](#proposal) - [User Stories (Optional)](#user-stories-optional) - [Story 1](#story-1) @@ -97,11 +104,6 @@ tags, and then generate with `hack/update-toc.sh`. - [Open Questions](#open-questions) - [Pod QoS class](#pod-qos-class) - [Default class](#default-class) - - [Future work](#future-work) - - [Pod Spec](#pod-spec) - - [Resource status/capacity](#resource-statuscapacity) - - [Resource discovery](#resource-discovery) - - [Access control](#access-control) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) - [Unit tests](#unit-tests) @@ -273,6 +275,166 @@ and make progress. - Discovery of the class-based resources (will be addressed in a separate KEP) - Access control (will be addressed in a separate KEP) +## Implementation phases + +We have split the full implementation of class resources into multiple phases, +building functionality gradually, step-bu-step. The goal is to make the +discussions more focused and easier. We may also learn on the way, insights +from earlier phases affecting design choises made in the later phases, +hopefully resulting in a better overall end result. However, we also outline +all the future steps to not loose the overall big picture. + +### Phase 1 + +This KEP (the [Proposal](#proposal)) implements the first phase. The goal is to +enable a bare minimum for users to leverage class resources and start +experimenting with them in Kubernetes: + +- extend the CRI protocol to allow class resource assignment +- implement pod annotations as an initial user interface +- introduce a feature gate for enabling class resource support in kubelet + +### Future work + +This section sheds light on the end goal of this work in order to better +evaluate this KEP in a broader context. What a fully working solution would +consists of and what the (next) steps to accomplish that would be. These topics +are currently out of the scope of this KEP and were listed under +[Non-goals](#non-goals). + +#### Pod Spec + +Replace pod annotations with proper user interface via the Pod spec. Below, one +possible option is presented. + +Introduce a new field (e.g. class) into ResourceRequirements of Container. + +```diff +// ResourceRequirements describes the compute resource requirements. +type ResourceRequirements struct { + // Limits describes the maximum amount of compute resources allowed. + Limits ResourceList `json:"limits,omitempty" + // Requests describes the minimum amount of compute resources required. + Requests ResourceList `json:"requests,omitempty" ++ // Classes specifies the class resources that the container should be assigned ++ Classes map[ClassResourceName]string +} + ++// ClassResourceName is the name of a class-based resource. ++type ClassResourceName string +``` + +Also, we add a `Resources` field to the `PodSpec`. We will re-use the existing +`ResourceRequirements` type but Limits and Requests must be left empty. Classes +may be set and they represent the Pod-level assignment of class resources, +comparable to the PodClassResources message in PodSandboxConfig in the CRI API. + +```diff + type PodSpec struct { +@@ -224,4 +224,8 @@ type PodSpec struct { + // Default to false. + // +optional + SetHostnameAsFQDN *bool `json:"setHostnameAsFQDN,omitempty" protobuf:"varint,35,opt,name=setHostnameAsFQDN"` ++ // Pod-level resources. Currently, requests and limits are not allowed ++ // to be specified for pods. ++ // +optional ++ Resources ResourceRequirements + } +``` + +In practice, the class resource information will be directly used in the CRI +ContainerConfig (e.g. CreateContainerRequest message). At this point, without +resource discovery or access control kubelet does not do any validity checking +of the values. Invalid class assignments will cause an error in the container +runtime which causes the corresponding CRI RuntimeService request (e.g. +RunPodSandbox or CreateContainer) to fail with an error. + +Input validation of classes very similar to labels is implemented: keys +(`ClassResourceName`) and values must be non-empty, less than 64 characters +long, must start and end with an alphanumeric character and may contain only +alphanumeric characters, dashes, underscores or dots (`-`, `_` or `.`). +Similar to labels, a namespace prefix (FQDN subdomain separated with a slash) +in the key is allowed, similar to labels, e.g. `vendor/resource`. + +#### Resource status/capacity + +This KEP does not speak out anything about presenting the available resource +types (or classes within) to the users. + +Some alternatives for presenting this information: + +1. Supplement `NodeStatus` + + ```diff + // NodeStatus is information about the current status of a node. + type NodeStatus struct { + // Capacity represents the total resources of a node. + // More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#capacity + // +optional + Capacity ResourceList `json:"capacity,omitempty" protobuf:"bytes,1,rep,name=capacity,casttype=ResourceList,castkey=ResourceName"` + // Allocatable represents the resources of a node that are available for scheduling. + // Defaults to Capacity. + // +optional + Allocatable ResourceList `json:"allocatable,omitempty" protobuf:"bytes,2,rep,name=allocatable,casttype=ResourceList,castkey=ResourceName"` + + // ResourceClasses lists the available + + ClassResourdes []ClassResourceList + + + +type ClassResourceList { + + // Name of the resource + + Name ClassResourceName + + // Classes available in the resource + + Classes []string + +} + ``` +1. Separate API objects (e.g. something like `RuntimeClass`). Doesn't + necessarily that neatly align with two level hierarchy (resource name and a + set of classes within). Also, only best suited to homogenous clusters. + +#### Resource discovery + +Some possible alternatives. + +1. Reported by the container runtime. Container runtime is (or at least should + be) aware of all resource types and the classes within. It could advertise + the resources e.g. via either: + + 1. A separate gRPC endpoint or update `StatusResponse` + 1. OR Populate a (json) file in a known location + + As a reference, the API currently allows listing of some objects/resources + (Pods, Containers, Images etc) but not some others. + +1. Manual configuration. Would be best suited for case where resources and + classes would be presented as separate API objects. + +#### Access control + +If class resources were advertised as API objects the natural access +control mechanism would be through RBAC. + +If class resources were advertised in node status (similar to other resources), +access control could be achieved e.g. by extending ResourceQuotaSpec which +would implement restrictions based on the namespace. + +```diff + // ResourceQuotaSpec defines the desired hard limits to enforce for Quota. + type ResourceQuotaSpec struct { + // hard is the set of desired hard limits for each named resource. + Hard ResourceList + // A collection of filters that must match each object tracked by a quota. + // If not specified, the quota matches all objects. + Scopes []ResourceQuotaScope + // scopeSelector is also a collection of filters like scopes that must match each + // object tracked by a quota but expressed using ScopeSelectorOperator in combination + // with possible values. + ScopeSelector *ScopeSelector ++ // AllowedClasses specifies the list of allowed classes for each class-based resource ++ AllowedClasses map[ClassResourceName]ResourceClassList +} + ++// ResourceClassList is a list of classes of a specific type of class-based resource. ++type ResourceClassList []string +``` ## Proposal -# KEP-3008: Class-based resources +# [KEP-3008](#3008): QoS-class resources -We would like to add support for class-based resources in Kubernetes. -Class-based resources can be thought of as non-accountable resources, each of +We would like to add support for QoS-class resources in Kubernetes. +QoS-class resources can be thought of as non-accountable resources, each of which is presented by a set of classes. Being non-accountable means that multiple containers can be assigned to the same class. They are also supposed to be opaque to the CRI client in the sense that the container runtime takes care of configuration and control of the resources and the classes within. -A prime example of a class-based resource is Intel RDT (Resource Director +A prime example of a QoS-class resource is Intel RDT (Resource Director Technology). RDT is a technology for controlling the cache lines and memory bandwidth available to applications. RDT provides a class-based approach for QoS control of these shared resources: all processes in the same hardware class share a portion of cache lines and memory bandwidth. We also believe that the Linux Block IO controller (cgroup) should be handled -as a class-based resource on the level of container orchestration. This enables +as a QoS-class resource on the level of container orchestration. This enables configuring I/O scheduler priority and throttling I/O bandwidth per workload. -Having the support for class-based resources in place, it will provide a +Having the support for QoS-class resources in place, it will provide a framework for the future, for instance class-based network or memory type prioritization. @@ -254,11 +254,11 @@ List the specific goals of the KEP. What is it trying to achieve? How will we know that this has succeeded? --> -- Make it possible to request class resources +- Make it possible to request QoS-class resources - Support RDT class assignment of containers. This is already supported by the containerd and CRI-O runtime and part of the OCI runtime-spec - Support blockio class assignment of containers. -- Make the extensions flexible, enabling simple addition of other class-based +- Make the extensions flexible, enabling simple addition of other QoS-class resource types in the future. ### Non-Goals @@ -268,16 +268,16 @@ What is out of scope for this KEP? Listing non-goals helps to focus discussion and make progress. --> -- Interface or mechanism for configuring the class resources (responsibility of +- Interface or mechanism for configuring the QoS-class resources (responsibility of the container runtime). -- Enumerating possible (class) resource types or their detailed behavior +- Enumerating possible (QoS-class) resource types or their detailed behavior - Resource status/capacity (will be addressed in a separate KEP) -- Discovery of the class-based resources (will be addressed in a separate KEP) +- Discovery of the QoS-class resources (will be addressed in a separate KEP) - Access control (will be addressed in a separate KEP) ## Implementation phases -We have split the full implementation of class resources into multiple phases, +We have split the full implementation of QoS-class resources into multiple phases, building functionality gradually, step-by-step. The goal is to make the discussions more focused and easier. We may also learn on the way, insights from earlier phases affecting design choises made in the later phases, @@ -287,12 +287,12 @@ all the future steps to not lose the overall big picture. ### Phase 1 This KEP (the [Proposal](#proposal)) implements the first phase. The goal is to -enable a bare minimum for users to leverage class resources and start +enable a bare minimum for users to leverage QoS-class resources and start experimenting with them in Kubernetes: -- extend the CRI protocol to allow class resource assignment +- extend the CRI protocol to allow QoS-class resource assignment - implement pod annotations as an initial user interface -- introduce a feature gate for enabling class resource support in kubelet +- introduce a feature gate for enabling QoS-class resource support in kubelet ### Future work @@ -316,17 +316,17 @@ type ResourceRequirements struct { Limits ResourceList `json:"limits,omitempty" // Requests describes the minimum amount of compute resources required. Requests ResourceList `json:"requests,omitempty" -+ // Classes specifies the class resources that the container should be assigned ++ // Classes specifies the QoS-class resources that the container should be assigned + Classes map[ClassResourceName]string } -+// ClassResourceName is the name of a class-based resource. ++// ClassResourceName is the name of a QoS-class resource. +type ClassResourceName string ``` Also, we add a `Resources` field to the `PodSpec`. We will re-use the existing `ResourceRequirements` type but Limits and Requests must be left empty. Classes -may be set and they represent the Pod-level assignment of class resources, +may be set and they represent the Pod-level assignment of QoS-class resources, comparable to the PodClassResources message in PodSandboxConfig in the CRI API. ```diff @@ -342,7 +342,7 @@ comparable to the PodClassResources message in PodSandboxConfig in the CRI API. } ``` -In practice, the class resource information will be directly used in the CRI +In practice, the QoS-class resource information will be directly used in the CRI ContainerConfig (e.g. CreateContainerRequest message). At this point, without resource discovery or access control kubelet does not do any validity checking of the values. Invalid class assignments will cause an error in the container @@ -410,10 +410,10 @@ Some possible alternatives. #### Access control -If class resources were advertised as API objects the natural access +If QoS-class resources were advertised as API objects the natural access control mechanism would be through RBAC. -If class resources were advertised in node status (similar to other resources), +If QoS-class resources were advertised in node status (similar to other resources), access control could be achieved e.g. by extending ResourceQuotaSpec which would implement restrictions based on the namespace. @@ -429,11 +429,11 @@ would implement restrictions based on the namespace. // object tracked by a quota but expressed using ScopeSelectorOperator in combination // with possible values. ScopeSelector *ScopeSelector -+ // AllowedClasses specifies the list of allowed classes for each class-based resource ++ // AllowedClasses specifies the list of allowed classes for each QoS-class resource + AllowedClasses map[ClassResourceName]ResourceClassList } -+// ResourceClassList is a list of classes of a specific type of class-based resource. ++// ResourceClassList is a list of classes of a specific type of QoS-class resource. +type ResourceClassList []string ``` ## Proposal @@ -447,13 +447,13 @@ The "Design Details" section below is for the real nitty-gritty. --> -We extend the CRI protocol to contain information about the class-based +We extend the CRI protocol to contain information about the QoS-class resource assignment of containers. Currently we identify two types of resources (RDT and blockio) but the API changes will be generic so that it that will serve other similar resources in the future. We implement pod annotations the initial mechanism for Kubernetes users to -control class resource assignment. We define two class resources that can be +control QoS-class resource assignment. We define two QoS-class resources that can be controlled via annotations, i.e. RDT and blockio. We introduce a feature gate that enables kubelet to interpret pod annotations @@ -498,7 +498,7 @@ Go in to as much detail as necessary here. This might be a good place to talk about core concepts and how they relate. --> -This is only the first step in getting class-based resources supported in +This is only the first step in getting QoS-class resources supported in Kubernetes. Important pieces like resource assignment via pod spec, resource status, resource disovery and permission control are [non-goals](#non-goals) not solved here. @@ -541,7 +541,7 @@ required) or even code snippets. If there's any ambiguity about HOW your proposal will be implemented, this is the place to discuss them. --> -Configuration and management of the class resources is fully handled by the +Configuration and management of the QoS-class resources is fully handled by the underlying container runtime and is invisible to kubelet. An error to the CRI client is returned if the specified class is not available. @@ -550,7 +550,7 @@ client is returned if the specified class is not available. The following additions to the CRI protocol are suggested. The `ContainerConfig` message will be supplemented with new `class_resources` -field, providing per-container setting for class resources. +field, providing per-container setting for QoS-class resources. ```diff @@ -562,7 +562,7 @@ field, providing per-container setting for class resources. // Configuration specific to Windows containers. WindowsContainerConfig windows = 16; + -+ // Configuration of class resources. ++ // Configuration of QoS-class resources. + ContainerClassResources class_resources = 17; } @@ -587,7 +587,7 @@ runtime. LinuxPodSandboxConfig linux = 8; // Optional configurations specific to Windows hosts. WindowsPodSandboxConfig windows = 9; -+ // Configuration of class resources. ++ // Configuration of QoS-class resources. + PodClassResources class_resources = 10; + } @@ -600,15 +600,15 @@ runtime. +} ``` -Also, define "known" class resource types to more easily align container +Also, define "known" QoS-class resource types to more easily align container runtime implementations: ``` + +const ( -+ // ClassResourceRdt is the name of the RDT class resource ++ // ClassResourceRdt is the name of the RDT QoS-class resource + ClassResourceRdt = "rdt" -+ // ClassResourceBlockio is the name of the blockio class resource ++ // ClassResourceBlockio is the name of the blockio QoS-class resource + ClassResourceBlockio = "blockio" +) ``` @@ -617,14 +617,14 @@ runtime implementations: Use Pod annotation as the initial K8s user interface, similar to e.g. how seccomp support was added. This will bridge the gap between enabling -class-based resources in the CRI protocol and making them available in the Pod +QoS-class resources in the CRI protocol and making them available in the Pod spec. A feature gate ClassResources enables kubelet to look for pod -annotations and set the class resource assignment via CRI protocol accordingly. +annotations and set the QoS-class resource assignment via CRI protocol accordingly. Specifically, kubelet will support annotations for specifying RDT and blockio -class, the two types of class resources that already have basic support in the +class, the two types of QoS-class resources that already have basic support in the container runtimes. - `rdt.resources.beta.kubernetes.io/pod` for setting a Pod-level default RDT @@ -638,7 +638,7 @@ container runtimes. ### Container runtimes -We have implemented class-based RDT and blockio support in CRI-O and +We have implemented QoS-class RDT and blockio support in CRI-O and containerd: - cri-o: @@ -648,24 +648,24 @@ containerd: - [~~Support Intel RDT~~](https://github.com/containerd/containerd/pull/5439) - [~~Support for cgroups blockio~~](https://github.com/containerd/containerd/pull/5490) -The design paradigm here is that the container runtime configures the resource -classes according to a given configuration file. Enforcement on containers is +The design paradigm here is that the container runtime configures the QoS-class +resources according to a given configuration file. Enforcement on containers is done via OCI. User interface is provided through pod and container annotations. ### Open Questions #### Pod QoS class -The Pod QoS class could be communicated to the container runtime as a class +The Pod QoS class could be communicated to the container runtime as a QoS-class resource, too. This information is currently internal to kubelet. However, container runtimes (CRI-O, at least) are already depending on this information and currently determining it indirectly by evaluating other CRI parameters. It -would be better to explicitly state the Pod QoS class and class resources would +would be better to explicitly state the Pod QoS class and QoS-class resources would look like a logical place for that. This also makes it techically possible to have container-specific QoS classes (as a possible future enhancement of K8s). -Communicating Pod QoS class via class resources would advocate moving class -resources up to `ContainerConfig`. +Communicating Pod QoS class via QoS-class resources would advocate moving +QoS-class resources up to `ContainerConfig`. Making this change, it would also be possible to separate `oom_score_adj` from the pod qos class in the future. The runtime could provide a set of OOM @@ -740,7 +740,7 @@ https://storage.googleapis.com/k8s-triage/index.html Alpha: no specific integration tests are planned for Alpha. Beta: Existing integration tests for affected components (e.g. scheduler, node -status, quota) are extended to cover class resources. +status, quota) are extended to cover QoS-class resources. - : @@ -759,7 +759,7 @@ We expect no non-infra related flakes in the last month as a GA graduation crite Alpha: no specific e2e-tests are planned. In order to be able to run e2e tests, a cluster with nodes having runtime -support for class resources is required. +support for QoS-class resources is required. ### Graduation Criteria @@ -1134,7 +1134,7 @@ Describe them, providing: - Supported number of objects per namespace (for namespace-scoped objects) --> -Class resources do extend existing API types but not introduce new types of +QoS-class resources do extend existing API types but not introduce new types of objects. However, future work (KEPs) enabling resource discovery and permission control might change this. @@ -1246,7 +1246,7 @@ information to express the idea and why it was not acceptable. ### Pod spec Instead of introducing Pod annotations as an intermediate solution for -controlling the class resources, the Pod spec could be updated in lock-step +controlling the QoS-class resources, the Pod spec could be updated in lock-step with the CRI api. See the section [(Future work) Pod spec](#pod-spec) for more details. From 88d4ae227d9f68eb8639f41834d11ebef144ac37 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Fri, 17 Jun 2022 22:25:09 +0300 Subject: [PATCH 16/92] KEP-3008: add resource updates As a response to review comments from SergeyKanzhelev, add support for resource updates of running containers. Also, discuss immutable resources. --- .../3008-cri-class-based-resources/README.md | 70 ++++++++++++++++++- 1 file changed, 69 insertions(+), 1 deletion(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 4441629d89d..e33e2e9bcf7 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -258,6 +258,7 @@ know that this has succeeded? - Support RDT class assignment of containers. This is already supported by the containerd and CRI-O runtime and part of the OCI runtime-spec - Support blockio class assignment of containers. +- Make the API to support updating QoS-class resource assignment of running containers - Make the extensions flexible, enabling simple addition of other QoS-class resource types in the future. @@ -290,7 +291,7 @@ This KEP (the [Proposal](#proposal)) implements the first phase. The goal is to enable a bare minimum for users to leverage QoS-class resources and start experimenting with them in Kubernetes: -- extend the CRI protocol to allow QoS-class resource assignment +- extend the CRI protocol to allow QoS-class resource assignment and updates - implement pod annotations as an initial user interface - introduce a feature gate for enabling QoS-class resource support in kubelet @@ -349,6 +350,9 @@ of the values. Invalid class assignments will cause an error in the container runtime which causes the corresponding CRI RuntimeService request (e.g. RunPodSandbox or CreateContainer) to fail with an error. +This phase would likely also wire QoS-class resources to +[In-place pod vertical scaling](#1287), allowing updates of running containers. + Input validation of classes very similar to labels is implemented: keys (`ClassResourceName`) and values must be non-empty, less than 64 characters long, must start and end with an alphanumeric character and may contain only @@ -384,6 +388,8 @@ Some alternatives for presenting this information: + Name ClassResourceName + // Classes available in the resource + Classes []string + + // Immutable is set to true if the resource type does not support in-place updates + + Immutable bool +} ``` 1. Separate API objects (e.g. something like `RuntimeClass`). Doesn't @@ -392,6 +398,19 @@ Some alternatives for presenting this information: #### Resource discovery +Resource discovery together with resource status/capacity information (above) +enables scheduler support for QoS-class resources. This would also make it +possible to delete/evict pods from nodes when requested QoS-class resource +types (or classes within) are no longer availab.e + +The discovery needs to be able to carry the following information: + +- Available QoS-class resource types. +- Available classes withing each resource type. +- Whether the resource type is immutable or if it supports in-place updates. + In-place updates of resoures might not be possible because of runtime + limitations or the underlying technology, for example. + Some possible alternatives. 1. Reported by the container runtime. Container runtime is (or at least should @@ -399,6 +418,29 @@ Some possible alternatives. the resources e.g. via either: 1. A separate gRPC endpoint or update `StatusResponse` + + ```diff + message RuntimeStatus { + // List of current observed runtime conditions. + repeated RuntimeCondition conditions = 1; + + // Information about the discovered resources + + ResourcesInfo resources = 2; + +} + + + +// ResourcesInfo contains information about the resources discovered by the + +// runtime. + +message ResourcesInfo { + + repeated ClassResourceInfo class_resources = 1; + +} + + + +// ClassResourceInfo contains information about one type of class resource. + +message ClassResourceInfo { + + string Name = 1; + + repeated string classes = 2; + + bool immutable = 3; + } + ``` + 1. OR Populate a (json) file in a known location Of these, the first option is more idiomatic for how cri behaves today. @@ -452,6 +494,13 @@ resource assignment of containers. Currently we identify two types of resources (RDT and blockio) but the API changes will be generic so that it that will serve other similar resources in the future. +We also extend the CRI protocol to support updates of QoS-class resource +assignment of running containers. We recognize that currently container +runtimes lack the capability to update either of the two types of QoS-class +resources we have identified (RDT and blockio). However, there is no technical +limitation in that and we are planning to implement update support for them +in the future. + We implement pod annotations the initial mechanism for Kubernetes users to control QoS-class resource assignment. We define two QoS-class resources that can be controlled via annotations, i.e. RDT and blockio. @@ -574,6 +623,25 @@ field, providing per-container setting for QoS-class resources. +} ``` +The `UpdateContainerResourcesRequest` message will be similarly extended to +allow updating of QoS-class resource configuration of a running container. +Depending on runtime-level support of a particular resource (and possibly the +type of resource) UpdateContainerResourcesRequest might fail. Later phases +(with resource discovery/status) adds the ability to distinguish immutable +resource types. Note that neither of the existing QoS-class resource types (RDT +or blockio) support updates because of runtime limitations, yet. + +```diff + message UpdateContainerResourcesRequest { + + ... + // resources to update or other options to use when updating the container. + map annotations = 4; ++ // Configuration of class resources. ++ ContainerClassResources class_resources = 5; +} +``` + The `PodSandboxConfig` will be supplemented with a corresponding `class_resources` field that will be the Pod level configuration. Depending on the resource this might be interpreted as a pod-level default (that is used if From d295052797ee09c5a9270963d70f5a67e5312ad5 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Fri, 17 Jun 2022 22:37:09 +0300 Subject: [PATCH 17/92] KEP-3008: fix typo --- keps/sig-node/3008-cri-class-based-resources/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index e33e2e9bcf7..85ce6520763 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -406,7 +406,7 @@ types (or classes within) are no longer availab.e The discovery needs to be able to carry the following information: - Available QoS-class resource types. -- Available classes withing each resource type. +- Available classes within each resource type. - Whether the resource type is immutable or if it supports in-place updates. In-place updates of resoures might not be possible because of runtime limitations or the underlying technology, for example. From fcc5ed130f2ce9a2d0cf43c9b63fb4914e06ae6c Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 21 Jun 2022 17:06:14 +0300 Subject: [PATCH 18/92] KEP-3008: address review feedback from klueska --- .../3008-cri-class-based-resources/README.md | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 85ce6520763..264cdba3c86 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -343,6 +343,9 @@ comparable to the PodClassResources message in PodSandboxConfig in the CRI API. } ``` +There is already an ongoing effort to add [Pod level resource limits](#1592) +that aims at adding a pod level `Resources` field in a similar fashion. + In practice, the QoS-class resource information will be directly used in the CRI ContainerConfig (e.g. CreateContainerRequest message). At this point, without resource discovery or access control kubelet does not do any validity checking @@ -381,7 +384,7 @@ Some alternatives for presenting this information: // +optional Allocatable ResourceList `json:"allocatable,omitempty" protobuf:"bytes,2,rep,name=allocatable,casttype=ResourceList,castkey=ResourceName"` + // ResourceClasses lists the available - + ClassResourdes []ClassResourceList + + ClassResources []ClassResourceList + +type ClassResourceList { + // Name of the resource @@ -443,7 +446,7 @@ Some possible alternatives. 1. OR Populate a (json) file in a known location - Of these, the first option is more idiomatic for how cri behaves today. + Of these, the first option is more idiomatic for how CRI behaves today. As a reference, the API currently allows listing of some objects/resources (Pods, Containers, Images etc) but not some others. @@ -491,7 +494,7 @@ nitty-gritty. We extend the CRI protocol to contain information about the QoS-class resource assignment of containers. Currently we identify two types of -resources (RDT and blockio) but the API changes will be generic so that it that +resources (RDT and blockio) but the API changes will be generic so that it will serve other similar resources in the future. We also extend the CRI protocol to support updates of QoS-class resource @@ -615,10 +618,11 @@ field, providing per-container setting for QoS-class resources. + ContainerClassResources class_resources = 17; } -+// ContainerClassResources specifies the configuration of class based ++// ContainerClassResources specifies the configuration of QoS-class resources +// resources of a container. +message ContainerClassResources { -+ // Resource classes of the container will be assigned to ++ // QoS-class resource assignment of the container. ++ // Key is the resource type and values is the class name within the resource type. + map classes = 1; +} ``` @@ -660,10 +664,11 @@ runtime. + } -+// PodClassResources specifies the configuration of class based ++// PodClassResources specifies the configuration of QoS-class resources +// resources of a pod. +message PodClassResources { -+ // Resource classes of the pod will be assigned to ++ // QoS-class resource assignment of the pod. ++ // Key is the resource type and values is the class name within the resource type. + map class = 1; +} ``` From d2356e5d1c4d71017a50e18f25e723904ea8cf0c Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 21 Jun 2022 17:12:01 +0300 Subject: [PATCH 19/92] KEP-3008: correct PRR approver --- keps/prod-readiness/sig-node/3008.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/prod-readiness/sig-node/3008.yaml b/keps/prod-readiness/sig-node/3008.yaml index 2a9e151f6d9..fb59682aa20 100644 --- a/keps/prod-readiness/sig-node/3008.yaml +++ b/keps/prod-readiness/sig-node/3008.yaml @@ -3,4 +3,4 @@ # of http://git.k8s.io/enhancements/OWNERS_ALIASES kep-number: 3008 alpha: - approver: "@dchen1107" + approver: "@johnbelamaric" From 4e41cbf553010e6f8ec1565f434d2bf7d86514a2 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 21 Jun 2022 21:51:40 +0300 Subject: [PATCH 20/92] KEP-3008: fill in PRR questionnaire --- .../3008-cri-class-based-resources/README.md | 123 ++++++++++++++++-- 1 file changed, 114 insertions(+), 9 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 264cdba3c86..bf5d2e5612f 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -968,6 +968,10 @@ team. Please reach out on the you need any help or guidance. --> +In this section we refer to different +[implementation phases](#implementation-phases). In this KEP we're now +targeting phase 1. + ### Feature Enablement and Rollback -- [ ] Feature gate (also fill in values in `kep.yaml`) - - Feature gate name: +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: ClassResources - Components depending on the feature gate: + - Implementation Phase 1: + - kubelet + - Future phases (with updated pod spec and scheduler and quota support): + - kubelet + - kube-apiserver + - kube-scheduler + - kube-controller-manager - [ ] Other - Describe the mechanism: - Will enabling / disabling the feature require downtime of the control @@ -1003,6 +1014,8 @@ Any change of default behavior may be surprising to users or break existing automations, so be extremely careful here. --> +No. + ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? +Yes it can. + +Implementation Phase 1: In this phase pod annotations are used as the user +interface for assigning QoS-class resources to workloads. Existing +running workloads continue to work without any changes as their QoS-class +resource assigment in the runtime is not changed. +Restarting or re-deploying a workload causes it to lose its QoS-class resource +assignment as the annotation parsing in kubelet is disabled. In other words, +the workload is able to run but the QoS-class resource assignment request from +the user (via pod annotations) is effectively ignored. + +Future implementation phases: running workloads continue to work without any +changes. Restarting or re-deploying a workload causes it to fail as the +requested QoS-class resources are not available. + ###### What happens if we reenable the feature if it was previously rolled back? +Implementation Phase 1: workloads need to be restarted to re-evaluate the pod +annotations to correctly communicate QoS-class resource assignments to the +container runtime. + +Future implementation phases: workloads might have failed because of +unsupported fields in the pod spec reqource requirements and need to be +restarted. + ###### Are there any tests for feature enablement/disablement? +Implementation phase 1: No. + +Future implementation phases: unit tests for handling the changes in pod spec +are implemented. + ### Rollout, Upgrade and Rollback Planning +Implementation Phase 1: we rely on inspection of pod annotations inside kubelet +which should make rollout/rollback failure-safe. Already running workloads are +not affected. + +Future implementation phases: TBD. + ###### What specific metrics should inform a rollback? +Implementation Phase 1: watch for non-ready pods with CreateContainerError +status. The error message will indicate the if the failure is related to +QoS-class resources. + +Future implementation phases: TBD. + ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? +TBD in future implementation phases. + ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? +Implementation Phase 1: No. + +Future implementation phases: TBD but should be no. Disabling the feature +should preserve the data of new fields (e.g. in pod spec) even if they are +disabled. + ### Monitoring Requirements +Implementation Phase 1: by examining pod annotations. + +Future implementation phases: by examining the new fields in pod spec. + ###### How can someone using this feature know that it is working for their instance? -- [ ] Events - - Event Reason: +- [x] Events + - Event Reason: Failed (CreateContainerError) + + + +To be defined in more detail in future implementation phases and for beta. ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? @@ -1125,18 +1195,25 @@ These goals will help you determine what you need to measure (SLIs) in the next question. --> +TBD in future implementation phases but basically the existing SLOs for Pods +should be adequate. + ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + +TBD in future implementation phases and for beta. ###### Are there any missing metrics that would be useful to have to improve observability of this feature? @@ -1145,6 +1222,8 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co implementation difficulties, etc.). --> +TBD in future implementation phases. + ### Dependencies +Implementation Phase 1: A container runtime with support for the new CRI API +fields is required. + ### Scalability -QoS-class resources do extend existing API types but not introduce new types of -objects. However, future work (KEPs) enabling resource discovery and permission -control might change this. +Implementation Phase 1: No. + +Future implementation phases: QoS-class resources do extend existing API types +but presumably not introduce new types of objects. However, the design for +resource discovery and permission control is not ready which might change this. ###### Will enabling / using this feature result in any new calls to the cloud provider? @@ -1231,8 +1315,17 @@ Describe them, providing: - Estimated amount of new objects: (e.g., new Object X for every existing Pod) --> -A new field in `ResourceRequirements` (of `Container`) will increase the size -of `Pod` objects by a bytes per class requested. +Implementation Phase 1: [pod annotations](#pod-annotations) are used as the +initial user interface so assign QoS-class resources to containers. Exact size +of each annotation varies (depending on the type of resource and whether it +is pod-level of container-specific) but the annotation key is expected to be +few tens of bytes. The value part is the name of the class expected to be a few +bytes long. + +Future implementations: New fields in the pod spec will increase the size of +`Pod` objects by a few bytes per class requested. New fields will be added to +NodeStatus which will increase its size. New field will be added to +ResourceQuotaSpec increasing its size. ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? @@ -1245,6 +1338,8 @@ Think about adding additional work or introducing new steps in between [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos --> +No, this is not expected. + ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? +No, this is not expected. + ### Troubleshooting +TBD. + ###### What steps should be taken if SLOs are not being met to determine the problem? +TBD. + ## Implementation History -Implementation phase 1: No. +Implementation phase 1: Unit test will be added to kubelet to test that +inspection of [pod annotations](#pod-annotations) is correctly disabled/enabled +with the feature gate. Future implementation phases: unit tests for handling the changes in pod spec are implemented. From 9965a87e28e38432728e032d1d48004b60204634 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Fri, 8 Jul 2022 17:41:34 +0300 Subject: [PATCH 22/92] KEP-3008: rethink pod-level resources Make the pod-level QoS-class resources totally independent of container-level QoS-class resources. Also, rename the annotations accordingly. --- .../3008-cri-class-based-resources/README.md | 61 ++++++++++++++++--- 1 file changed, 52 insertions(+), 9 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 47ab8c7252e..a1bd8ff82f5 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -86,6 +86,7 @@ tags, and then generate with `hack/update-toc.sh`. - [Phase 1](#phase-1) - [Future work](#future-work) - [Pod Spec](#pod-spec) + - [Update sandbox-level QoS-class resources](#update-sandbox-level-qos-class-resources) - [Resource status/capacity](#resource-statuscapacity) - [Resource discovery](#resource-discovery) - [Access control](#access-control) @@ -213,6 +214,15 @@ Having the support for QoS-class resources in place, it will provide a framework for the future, for instance class-based network or memory type prioritization. +We have identified the need for both container-level and pod-level QoS-class +resources as independent concepts. Intel RDT (above) is per-container by +design because of the hardware implementation (the control/class hiearchy is +flat). Also, the current support for blockio is container-level only (it is not +possible to configure pod sandbox-level cgroup parameters). However, we have +plans for implementing configuration of sandbox-level blockio parameters. Other +usage for pod sandbox-level QoS-class resources would be communicating the +Kubernetes Pod QoS class from kubelet to the container runtime. + ## Motivation We extend the CRI protocol to contain information about the QoS-class -resource assignment of containers. Currently we identify two types of -resources (RDT and blockio) but the API changes will be generic so that it -will serve other similar resources in the future. +resource assignment of containers and pods. + +Pod-level and container-level QoS-class resources are completely independent +resource types. E.g. specifying something in the pod-level request does not +mean specifying a pod-level default for all containers of the pod. + +Currently we identify two types of container-level QoS-class resources (RDT and +blockio) but the API changes will be generic so that it will serve other +similar resources in the future. Currently there are no immediately enabled +pod-level QoS-class resources but we see usage scenarios for those in the +future (communicating the pod QoS class to the runtime and enabling pod-level +cgroup controls for blockio). We also extend the CRI protocol to support updates of QoS-class resource assignment of running containers. We recognize that currently container @@ -505,8 +546,8 @@ limitation in that and we are planning to implement update support for them in the future. We implement pod annotations the initial mechanism for Kubernetes users to -control QoS-class resource assignment. We define two QoS-class resources that can be -controlled via annotations, i.e. RDT and blockio. +control QoS-class resource assignment. We define two container-level QoS-class +resources that can be controlled via annotations, i.e. RDT and blockio. We introduce a feature gate that enables kubelet to interpret pod annotations for controlling the RDT and blockio class of containers. @@ -700,19 +741,21 @@ Specifically, kubelet will support annotations for specifying RDT and blockio class, the two types of QoS-class resources that already have basic support in the container runtimes. -- `rdt.resources.beta.kubernetes.io/pod` for setting a Pod-level default RDT + class for all containers +- `rdt.resources.beta.kubernetes.io/default` for setting a Pod-level default RDT class for all containers - `rdt.resources.beta.kubernetes.io/container.` for container-specific RDT class settings -- `blockio.resources.beta.kubernetes.io/pod` for setting a Pod-level default + blockio class for all containers +- `blockio.resources.beta.kubernetes.io/default` for setting a Pod-level default blockio class for all containers - `blockio.resources.beta.kubernetes.io/container.` for container-specific blockio class settings ### Container runtimes -We have implemented QoS-class RDT and blockio support in CRI-O and -containerd: +We have implemented support (container-level QoS-class resources) for Intel RDT +and blockio in CRI-O and containerd: - cri-o: - [~~Add support for Intel RDT~~](https://github.com/cri-o/cri-o/pull/4830) From 9785c03b1aafec4ca4c24fc5221f07f49b2cb462 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Fri, 8 Jul 2022 18:37:25 +0300 Subject: [PATCH 23/92] KEP-3008: rename annotations --- keps/sig-node/3008-cri-class-based-resources/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index a1bd8ff82f5..f8bd1bfd255 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -742,14 +742,14 @@ class, the two types of QoS-class resources that already have basic support in t container runtimes. class for all containers -- `rdt.resources.beta.kubernetes.io/default` for setting a Pod-level default RDT +- `rdt.resources.alpha.kubernetes.io/default` for setting a Pod-level default RDT class for all containers -- `rdt.resources.beta.kubernetes.io/container.` for +- `rdt.resources.alpha.kubernetes.io/container.` for container-specific RDT class settings blockio class for all containers -- `blockio.resources.beta.kubernetes.io/default` for setting a Pod-level default +- `blockio.resources.alpha.kubernetes.io/default` for setting a Pod-level default blockio class for all containers -- `blockio.resources.beta.kubernetes.io/container.` for +- `blockio.resources.alpha.kubernetes.io/container.` for container-specific blockio class settings ### Container runtimes From c860ffed204c2879ed9fa785577f7ad4c9175ffc Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Mon, 12 Sep 2022 18:23:09 +0300 Subject: [PATCH 24/92] KEP-3008: update future phases resources Reflect the separation of pod vs container level qos-class resources in the future work section. --- .../3008-cri-class-based-resources/README.md | 36 ++++++++++++++----- 1 file changed, 27 insertions(+), 9 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index f8bd1bfd255..581fd0fb2c1 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -415,8 +415,10 @@ Some alternatives for presenting this information: // Defaults to Capacity. // +optional Allocatable ResourceList `json:"allocatable,omitempty" protobuf:"bytes,2,rep,name=allocatable,casttype=ResourceList,castkey=ResourceName"` - + // ResourceClasses lists the available - + ClassResources []ClassResourceList + + // PodClassResrouces lists the available class resources available for pod sandboxes. + + PodClassResources []ClassResourceList + + // ContainerClassResrouces lists the available class resources available for containers. + + ContainerClassResources []ClassResourceList + +type ClassResourceList { + // Name of the resource @@ -465,7 +467,10 @@ Some possible alternatives. +// ResourcesInfo contains information about the resources discovered by the +// runtime. +message ResourcesInfo { - + repeated ClassResourceInfo class_resources = 1; + + // Pod-level class resources available. + + repeated ClassResourceInfo pod_class_resources = 1; + + // Container-level class resources available. + + repeated ClassResourceInfo container_class_resources = 2; +} + +// ClassResourceInfo contains information about one type of class resource. @@ -474,7 +479,7 @@ Some possible alternatives. + repeated string classes = 2; + bool immutable = 3; } - ``` + ``` 1. OR Populate a (json) file in a known location @@ -506,12 +511,25 @@ would implement restrictions based on the namespace. // object tracked by a quota but expressed using ScopeSelectorOperator in combination // with possible values. ScopeSelector *ScopeSelector -+ // AllowedClasses specifies the list of allowed classes for each QoS-class resource -+ AllowedClasses map[ClassResourceName]ResourceClassList -} ++ // PodClassResources contains the allowed pod-level class resources. ++ PodClassResources []ClassResourceInfo ++ // ContainerClassResources contains the allowed container-level class resources. ++ ContainerClassResources []ClassResourceInfo + } + + // ResourceQuotaStatus defines the enforced hard limits and observed use. + type ResourceQuotaStatus struct { + ... + // Used is the current observed total usage of the resource in the namespace + // +optional + Used ResourceList ++ // PodClassResources contains the enforced set of pod-level class resources available. ++ PodClassResources []ClassResourceInfo ++ // ContainerClassResources contains the enforced set of container class resources available. ++ ContainerClassResources []ClassResourceInfo + } + -+// ResourceClassList is a list of classes of a specific type of QoS-class resource. -+type ResourceClassList []string ``` ## Proposal From a466f6d17c7c310627e29531e9e039c65bdc00b4 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 13 Sep 2022 13:55:54 +0300 Subject: [PATCH 25/92] KEP-3008: respond to review feedback from sftim - major rewrite of the summary: make it shorter and try to make it more understandable, incorporating text suggested by sftim - add reference/links to intel rdt and linux resctrlfs - limit the usage of word "we" --- .../3008-cri-class-based-resources/README.md | 118 +++++++++++------- 1 file changed, 73 insertions(+), 45 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 581fd0fb2c1..e69de6706c8 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -194,34 +194,31 @@ updates. [documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md --> -We would like to add support for QoS-class resources in Kubernetes. -QoS-class resources can be thought of as non-accountable resources, each of -which is presented by a set of classes. Being non-accountable means that -multiple containers can be assigned to the same class. They are also supposed -to be opaque to the CRI client in the sense that the container runtime takes -care of configuration and control of the resources and the classes within. +Add support to Kubernetes for declaring _quality-of-service_ resources, and +assigning these to Pods. A quality-of-service (QoS-class) resource is similar +to other Kubernetes resource types (i.e. native resources such as `cpu` and +`memory` or extended resources) because you can assign that resource to a +particular container. However, QoS-class resources are also different from +those other resources because they are used to assign a _class identifier_, +rather than to declare a specific amount of capacity that is allocated. + +Main characteristics of the new resource type (and the technologies they are +aimed at enabling) are: + +- multiple containers can be assigned to the same class of a certain type of + resource +- resources are represented by a limited set of class identifiers +- each type of resource has its own set of class identifiers + +With QoS-class resources, Pods and their containers can request +opaque QoS-class identifiers (classes) for some particular mechanism +(QoS-class resource type), such as block I/O bandwidth. Kubelet relays this +information to the container runtime which is responsible for enforcing the +request in the underlying system. A prime example of a QoS-class resource is Intel RDT (Resource Director Technology). RDT is a technology for controlling the cache lines and memory -bandwidth available to applications. RDT provides a class-based approach for -QoS control of these shared resources: all processes in the same hardware class -share a portion of cache lines and memory bandwidth. - -We also believe that the Linux Block IO controller (cgroup) should be handled -as a QoS-class resource on the level of container orchestration. This enables -configuring I/O scheduler priority and throttling I/O bandwidth per workload. -Having the support for QoS-class resources in place, it will provide a -framework for the future, for instance class-based network or memory type -prioritization. - -We have identified the need for both container-level and pod-level QoS-class -resources as independent concepts. Intel RDT (above) is per-container by -design because of the hardware implementation (the control/class hiearchy is -flat). Also, the current support for blockio is container-level only (it is not -possible to configure pod sandbox-level cgroup parameters). However, we have -plans for implementing configuration of sandbox-level blockio parameters. Other -usage for pod sandbox-level QoS-class resources would be communicating the -Kubernetes Pod QoS class from kubelet to the container runtime. +bandwidth available to applications. ## Motivation @@ -234,29 +231,51 @@ demonstrate the interest in a KEP within the wider Kubernetes community. [experience reports]: https://github.com/golang/go/wiki/ExperienceReports --> -RDT implements a class-based mechanism for controlling the cache and memory -bandwidth QoS of applications, providing a tool for mitigating noisy neighbors -and fulfilling SLAs. In Linux control happens via resctrl -- a -pseudo-filesystem provided by the kernel which makes it virtually agnostic of -the hardware architecture. The OCI runtime-spec has supported Intel RDT for a -while already. Other hardware vendors have comparable technologies which use -the same resctrl interface. +This enhancement proposal aims at improving the quality of service of +applications in Kubernetes by introducing a new type of resource control +mechanism. Certain types of resources are inherently shared by application (e.g. +cache, memory bandwidth and disk I/O) and while there are technologies for +controlling these, there is currently no meaningful way in Kubernetes to +support those tehcnologies. This proposal suggests to address the issue above +in a generalized way by extending the Kubernetes resource model with a new type +of resources, i.e. QoS-class resources. + +[Intel RDT][intel-rdt] implements a class-based mechanism for controlling the +cache and memory bandwidth QoS of applications. All processes in the same +hardware class share a portion of cache lines and memory bandwidth. RDT +proveides a way for mitigating noisy neighbors and fulfilling SLAs. In Linux +control happens via resctrl -- a pseudo-filesystem provided by the kernel which +makes it virtually agnostic of the hardware architecture. The OCI runtime-spec +has supported Intel RDT for a while already. Other hardware vendors have +comparable technologies which use the same [resctrl interface][linux-resctrl]. The Linux Block IO controller parameters depend very heavily on the underlying hardware and system configuration (device naming/numbering, IO scheduler configuration etc) which makes it very impractical to control from the Pod spec -level. In order to hide this complexity the concept of blockio classes is being +level. In order to hide this complexity the concept of blockio classes has been added to the container runtimes (CRI-O and containerd). A system administrator is able to configure blockio controller parameters on per-class basis and the -classes are then made available for CRI clients. - -Currently, there is no mechanism in Kubernetes to use these types of resources -. CRI-O and containerd runtimes have support for RDT and blockio classes and -they provide an bridge-gap user interface through special pod annotations. We -would like to eventually get these types of resources first class citizen and +classes are then made available for CRI clients. Following this model also +provies a possible framework for the future improvements, for instance enabling +class-based network or memory type prioritization of applications. + +Currently, there is no mechanism in Kubernetes to use these types of resources. +CRI-O and containerd runtimes have support for RDT and blockio classes and they +provide an bridge-gap user interface through special pod annotations. We would +like to eventually get these types of resources first class citizen and properly supported in Kubernetes, providing visibility, a well-defined user interface, and permission controls. +It seems necessary to support both container-level and pod-level QoS-class +resources as independent concepts. Intel RDT (above) is per-container by +design because of the hardware implementation (the control/class hiearchy is +flat). Also, the current support for blockio is container-level only (it is not +possible to configure pod sandbox-level cgroup parameters). However, having +pod-level QoS-class resources makes it possible to implement support for +sandbox-level blockio parameters. Other usage for pod sandbox-level QoS-class +resources would be communicating the Kubernetes Pod QoS class from +kubelet to the container runtime. + ### Goals +[intel-rdt]: https://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html +[linux-resctrl]: https://www.kernel.org/doc/html/latest/x86/resctrl.html From 0ac20f4e076f7106c67e075e596c24d69ace5887 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 13 Sep 2022 17:12:17 +0300 Subject: [PATCH 26/92] KEP-3008: fix formatting of one code block --- .../3008-cri-class-based-resources/README.md | 44 +++++++++---------- 1 file changed, 22 insertions(+), 22 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index e69de6706c8..1db8ddca89e 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -481,28 +481,28 @@ Some possible alternatives. 1. A separate gRPC endpoint or update `StatusResponse` ```diff - message RuntimeStatus { - // List of current observed runtime conditions. - repeated RuntimeCondition conditions = 1; - + // Information about the discovered resources - + ResourcesInfo resources = 2; - +} - + - +// ResourcesInfo contains information about the resources discovered by the - +// runtime. - +message ResourcesInfo { - + // Pod-level class resources available. - + repeated ClassResourceInfo pod_class_resources = 1; - + // Container-level class resources available. - + repeated ClassResourceInfo container_class_resources = 2; - +} - + - +// ClassResourceInfo contains information about one type of class resource. - +message ClassResourceInfo { - + string Name = 1; - + repeated string classes = 2; - + bool immutable = 3; - } + message RuntimeStatus { + // List of current observed runtime conditions. + repeated RuntimeCondition conditions = 1; + + // Information about the discovered resources + + ResourcesInfo resources = 2; + +} + + + +// ResourcesInfo contains information about the resources discovered by the + +// runtime. + +message ResourcesInfo { + + // Pod-level class resources available. + + repeated ClassResourceInfo pod_class_resources = 1; + + // Container-level class resources available. + + repeated ClassResourceInfo container_class_resources = 2; + +} + + + +// ClassResourceInfo contains information about one type of class resource. + +message ClassResourceInfo { + + string Name = 1; + + repeated string classes = 2; + + bool immutable = 3; + } ``` 1. OR Populate a (json) file in a known location From 434a21c1dfd06d2d445a0531f8ea87f65d583a7b Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 13 Sep 2022 17:21:39 +0300 Subject: [PATCH 27/92] KEP-3008: add diff specifier to one code block --- keps/sig-node/3008-cri-class-based-resources/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 1db8ddca89e..d4e77ee0079 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -759,7 +759,7 @@ runtime. Also, define "known" QoS-class resource types to more easily align container runtime implementations: -``` +```diff + +const ( + // ClassResourceRdt is the name of the RDT QoS-class resource From 3e36f9541b9556e3c9729ae919506f4bbd5cc88f Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Mon, 19 Sep 2022 13:52:05 +0300 Subject: [PATCH 28/92] KEP-3008: fix reference to KEP-2837 --- keps/sig-node/3008-cri-class-based-resources/README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index d4e77ee0079..87390bf6367 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -380,7 +380,7 @@ comparable to the PodClassResources message in PodSandboxConfig in the CRI API. } ``` -There is already an ongoing effort to add [Pod level resource limits](#1592) +There is already an ongoing effort to add [Pod level resource limits][kep-2837] that aims at adding a pod level `Resources` field in a similar fashion. In practice, the QoS-class resource information will be directly used in the CRI @@ -1548,3 +1548,4 @@ required. [intel-rdt]: https://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html [linux-resctrl]: https://www.kernel.org/doc/html/latest/x86/resctrl.html +[kep-2837]: https://github.com/kubernetes/enhancements/pull/1592 From bff4a93a5aca217b52b0d6733a7e7515f4751508 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Mon, 19 Sep 2022 13:52:35 +0300 Subject: [PATCH 29/92] KEP-3008: address review feedback from haircommander --- .../3008-cri-class-based-resources/README.md | 58 +++++++++++-------- 1 file changed, 34 insertions(+), 24 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 87390bf6367..355842b9382 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -243,7 +243,7 @@ of resources, i.e. QoS-class resources. [Intel RDT][intel-rdt] implements a class-based mechanism for controlling the cache and memory bandwidth QoS of applications. All processes in the same hardware class share a portion of cache lines and memory bandwidth. RDT -proveides a way for mitigating noisy neighbors and fulfilling SLAs. In Linux +provides a way for mitigating noisy neighbors and fulfilling SLAs. In Linux control happens via resctrl -- a pseudo-filesystem provided by the kernel which makes it virtually agnostic of the hardware architecture. The OCI runtime-spec has supported Intel RDT for a while already. Other hardware vendors have @@ -292,6 +292,11 @@ know that this has succeeded? - Make the extensions flexible, enabling simple addition of other QoS-class resource types in the future. - Make QoS-class resources opqaue (as possible) to the CRI client +- API changes to support updating Pod-level (sandbox-level) QoS-class resource + assignment of running pods ([future work](#future-work)) +- Resource status/capacity ([future work](#future-work)) +- Discovery of the QoS-class resources ([future work](#future-work)) +- Access control ([future work](#future-work)) ### Non-Goals @@ -303,11 +308,6 @@ and make progress. - Interface or mechanism for configuring the QoS-class resources (responsibility of the container runtime). - Enumerating possible (QoS-class) resource types or their detailed behavior -- API changes to support updating Pod-level (sandbox-level) QoS-class resource - assignment of running pods (will be addressed in a separate KEP) -- Resource status/capacity (will be addressed in a separate KEP) -- Discovery of the QoS-class resources (will be addressed in a separate KEP) -- Access control (will be addressed in a separate KEP) ## Implementation phases @@ -342,8 +342,8 @@ are currently out of the scope of this KEP and were listed under #### Pod Spec -Replace pod annotations with proper user interface via the Pod spec. Below, one -possible option is presented. +This future step will replace pod annotations with proper user interface via +the Pod spec. Below, one possible option is presented. Introduce a new field (e.g. class) into ResourceRequirements of Container. @@ -402,6 +402,8 @@ in the key is allowed, similar to labels, e.g. `vendor/resource`. #### Update sandbox-level QoS-class resources +This future step would be a second extesion to the CRI API. + Currently there is no endpoint in the CRI API to update the configuration of pod sandboxes. In contrast, container-level resources can be updated with the UpdateContainerResources API endpoint. In order to make container and pod @@ -421,8 +423,10 @@ This will likely required a new API endpoint in CRI: #### Resource status/capacity -This KEP does not speak out anything about presenting the available resource -types (or classes within) to the users. +This future step will add support for representing information about the +available QoS-class resource types (and the classes within each resource type). +This is important for the end users (to see what is available for the pods and +containers to consume) and also an enabler for scheduler support. Some alternatives for presenting this information: @@ -459,10 +463,13 @@ Some alternatives for presenting this information: #### Resource discovery +This future step will add support for discovery of available QoS-class resource +types (and the classes withing each type) on each node. + Resource discovery together with resource status/capacity information (above) enables scheduler support for QoS-class resources. This would also make it possible to delete/evict pods from nodes when requested QoS-class resource -types (or classes within) are no longer availab.e +types (or classes within) are no longer available. The discovery needs to be able to carry the following information: @@ -516,6 +523,9 @@ Some possible alternatives. #### Access control +This future step adds support for controlling the access to available QoS-class +resources. + If QoS-class resources were advertised as API objects the natural access control mechanism would be through RBAC. @@ -633,16 +643,14 @@ Go in to as much detail as necessary here. This might be a good place to talk about core concepts and how they relate. --> -This is only the first step in getting QoS-class resources supported in -Kubernetes. Important pieces like resource assignment via pod spec, resource -status, resource disovery and permission control are [non-goals](#non-goals) -not solved here. -These aspects -are briefly discussed in [future work](#future-work). The risk in this sort of -piecemeal approach is finding devil in the details, resulting in inconsistent -and/or crippled and/or cumbersome end result. However, there is a lot of -experience in extending the API and understanding which sort of solutions are -functional and practical. +Implementation Phase 1 is only the first step in getting QoS-class resources +supported in Kubernetes. Important pieces like resource assignment via pod +spec, resource status, resource disovery and permission control are [future +work](#future-work) not fully solved here. The risk in this sort of piecemeal +approach is finding devil in the details, resulting in inconsistent and/or +crippled and/or cumbersome end result. However, there is a lot of experience in +extending the API and understanding which sort of solutions are functional and +practical. ### Risks and Mitigations @@ -772,9 +780,11 @@ runtime implementations: ### Pod annotations Use Pod annotation as the initial K8s user interface, similar to e.g. how -seccomp support was added. This will bridge the gap between enabling -QoS-class resources in the CRI protocol and making them available in the Pod -spec. +seccomp support was added. This will bridge the gap between the first +implementation phase, i.e. enabling QoS-class resources in the CRI protocol, +and the future work which makes them available in the Pod spec. Kubelet will +read the specific Pod annotations and translate them into corresponding fields +in the CRI messages. A feature gate ClassResources enables kubelet to look for pod annotations and set the QoS-class resource assignment via CRI protocol accordingly. From 3db41ce32e3d51481a80b484b023a5d7253d8d58 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Mon, 19 Sep 2022 20:56:45 +0300 Subject: [PATCH 30/92] KEP-3008: fix spelling mistake --- keps/sig-node/3008-cri-class-based-resources/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 355842b9382..93861f9c769 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -464,7 +464,7 @@ Some alternatives for presenting this information: #### Resource discovery This future step will add support for discovery of available QoS-class resource -types (and the classes withing each type) on each node. +types (and the classes within each type) on each node. Resource discovery together with resource status/capacity information (above) enables scheduler support for QoS-class resources. This would also make it From dbeecfaee87963ed05aa85827e2ad12f975a28c1 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 20 Sep 2022 18:40:51 +0300 Subject: [PATCH 31/92] KEP-3008: address review comments from mikebrow --- .../3008-cri-class-based-resources/README.md | 12 +++++++----- .../sig-node/3008-cri-class-based-resources/kep.yaml | 4 ++-- 2 files changed, 9 insertions(+), 7 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 93861f9c769..130ec3ba88b 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -1132,12 +1132,15 @@ running workloads continue to work without any changes as their QoS-class resource assigment in the runtime is not changed. Restarting or re-deploying a workload causes it to lose its QoS-class resource assignment as the annotation parsing in kubelet is disabled. In other words, -the workload is able to run but the QoS-class resource assignment request from -the user (via pod annotations) is effectively ignored. +the workload is able to run but the QoS-class resource assignment requests from +the user, i.e. via pod annotations, are ignored by kubelet. Future implementation phases: running workloads continue to work without any changes. Restarting or re-deploying a workload causes it to fail as the -requested QoS-class resources are not available. +requested QoS-class resources are not available in Kubernetes anymore. The +resources are still supported by the underlying runtime but disabling the +feature in Kubernetes makes them unavailable and the related PodSpec fields are +not accepted in validation. ###### What happens if we reenable the feature if it was previously rolled back? @@ -1146,8 +1149,7 @@ annotations to correctly communicate QoS-class resource assignments to the container runtime. Future implementation phases: workloads might have failed because of -unsupported fields in the pod spec resource requirements and need to be -restarted. +unsupported fields in the pod spec and need to be restarted. ###### Are there any tests for feature enablement/disablement? diff --git a/keps/sig-node/3008-cri-class-based-resources/kep.yaml b/keps/sig-node/3008-cri-class-based-resources/kep.yaml index 3a323874f00..936a10989d9 100644 --- a/keps/sig-node/3008-cri-class-based-resources/kep.yaml +++ b/keps/sig-node/3008-cri-class-based-resources/kep.yaml @@ -15,11 +15,11 @@ stage: alpha # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.25" +latest-milestone: "v1.26" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: - alpha: "v1.25" + alpha: "v1.26" # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled From ff412733040f1c6a433f36ddf3eb46f78264a46a Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Thu, 22 Sep 2022 12:29:09 +0300 Subject: [PATCH 32/92] KEP-3008: address review comments from gjkim42 --- keps/sig-node/3008-cri-class-based-resources/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 130ec3ba88b..b196626fcce 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -256,8 +256,8 @@ level. In order to hide this complexity the concept of blockio classes has been added to the container runtimes (CRI-O and containerd). A system administrator is able to configure blockio controller parameters on per-class basis and the classes are then made available for CRI clients. Following this model also -provies a possible framework for the future improvements, for instance enabling -class-based network or memory type prioritization of applications. +provides a possible framework for the future improvements, for instance +enabling class-based network or memory type prioritization of applications. Currently, there is no mechanism in Kubernetes to use these types of resources. CRI-O and containerd runtimes have support for RDT and blockio classes and they @@ -410,7 +410,7 @@ UpdateContainerResources API endpoint. In order to make container and pod (sandbox) level QoS-class resources symmetric we want to make it possible to update of pod-level resource assignments, too. -This will likely required a new API endpoint in CRI: +This will likely require a new API endpoint in CRI: ```diff @@ -38,6 +38,8 @@ service RuntimeService { From 475ce50b0690db01845f98d92155e4020d01f90f Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 27 Sep 2022 15:17:18 +0300 Subject: [PATCH 33/92] KEP-3008: add a figure illustrating the overall design --- .../3008-cri-class-based-resources/README.md | 12 +++++++++--- .../3008-cri-class-based-resources/design.svg | 1 + 2 files changed, 10 insertions(+), 3 deletions(-) create mode 100644 keps/sig-node/3008-cri-class-based-resources/design.svg diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index b196626fcce..164605ae9a7 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -318,15 +318,21 @@ insights from earlier phases affecting design choises made in the later phases, hopefully resulting in a better overall end result. However, we also outline all the future steps to not lose the overall big picture. +The figure below illustrates the design of the full implementation (less quota) +and the part the first implementation phase covers. This KEP (the +[Proposal](#proposal)) in its current form implements this first phase – the +KEP will evolve and be supplemented with future phases getting implemented. + +![design](./design.svg) + In the current design QoS-class resources are designed to be opaque to the CRI client in the sense that the container runtime takes care of configuration and control of the resources and the classes within. ### Phase 1 -This KEP (the [Proposal](#proposal)) implements the first phase. The goal is to -enable a bare minimum for users to leverage QoS-class resources and start -experimenting with them in Kubernetes: +The goal is to enable a bare minimum for users to leverage QoS-class resources +and start experimenting with them in Kubernetes: - extend the CRI protocol to allow QoS-class resource assignment and updates - implement pod annotations as an initial user interface diff --git a/keps/sig-node/3008-cri-class-based-resources/design.svg b/keps/sig-node/3008-cri-class-based-resources/design.svg new file mode 100644 index 00000000000..bf989f43057 --- /dev/null +++ b/keps/sig-node/3008-cri-class-based-resources/design.svg @@ -0,0 +1 @@ +PodSpecresources:classes:rdt: goldblockio: hi-prioNode Xcapacity:rdt: [gold, silver, bronze]blockio: [hi-prio, lo-prio]API serverSchedulerruntimesystemkubeletnode Xenforce resource assignmentdiscoverresourcesannounceresourcespod is createdfind suitable nodepod is scheduledupdate node statuscontainercreated45637218out-of-scopephase 1 \ No newline at end of file From 3bf47d229f34c6c66af9e09229ef266e484ceab8 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 27 Sep 2022 17:51:15 +0300 Subject: [PATCH 34/92] KEP-3008: mention validation of annotations --- keps/sig-node/3008-cri-class-based-resources/README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 164605ae9a7..2bd16a513d1 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -810,6 +810,12 @@ container runtimes. - `blockio.resources.alpha.kubernetes.io/container.` for container-specific blockio class settings +A validation check (core api validation) is added in the API server to reject +changes to these annotations after a Pod has been created. This ensures that +the annotations always reflect the actual assignment of QoS-class resources of +a Pod. It also serves as part of the UX to indicate the in-place updates of the +resources via annotations is not supported. + ### Container runtimes Currently, there is support (container-level QoS-class resources) for Intel RDT From 25e8b0afbde9d1d6c75cb0a50e43fa36a21c9114 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 27 Sep 2022 18:39:17 +0300 Subject: [PATCH 35/92] KEP-3008: fixup: rethink pod-level resources Fix some omissions of the edit where pod-level resources were separated into an independent concept. --- .../3008-cri-class-based-resources/README.md | 14 +++++--------- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 2bd16a513d1..79c6303af0e 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -744,11 +744,8 @@ or blockio) support updates because of runtime limitations, yet. ``` The `PodSandboxConfig` will be supplemented with a corresponding -`class_resources` field that will be the Pod level configuration. Depending on -the resource this might be interpreted as a pod-level default (that is used if -nothing is specified in the `ContainerConfig`) or as a true Pod-level setting - -in the end the detailed behavior will be responsibility of the container -runtime. +`class_resources` field that specifies the assignment of pod-level QoS-class +resources. ```diff message PodSandboxConfig { @@ -1428,10 +1425,9 @@ Describe them, providing: Implementation Phase 1: [pod annotations](#pod-annotations) are used as the initial user interface so assign QoS-class resources to containers. Exact size -of each annotation varies (depending on the type of resource and whether it -is pod-level of container-specific) but the annotation key is expected to be -few tens of bytes. The value part is the name of the class expected to be a few -bytes long. +of each annotation varies (depending on the type of resource) but the +annotation key is expected to be few tens of bytes. The value part is the name +of the class expected to be a few bytes long. Future implementations: New fields in the pod spec will increase the size of `Pod` objects by a few bytes per class requested. New fields will be added to From c5246f63cbab5f5efd8dceaa2582bd8e5f70ba10 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 27 Sep 2022 19:30:27 +0300 Subject: [PATCH 36/92] KEP-3008: add resource discovery to phase 1 Propose to extend the CRI API for QoS-class resource discovery in implementation phase 1 of the KEP. --- .../3008-cri-class-based-resources/README.md | 221 ++++++++++-------- 1 file changed, 121 insertions(+), 100 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 79c6303af0e..613dc08ef81 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -88,7 +88,6 @@ tags, and then generate with `hack/update-toc.sh`. - [Pod Spec](#pod-spec) - [Update sandbox-level QoS-class resources](#update-sandbox-level-qos-class-resources) - [Resource status/capacity](#resource-statuscapacity) - - [Resource discovery](#resource-discovery) - [Access control](#access-control) - [Proposal](#proposal) - [User Stories (Optional)](#user-stories-optional) @@ -100,7 +99,14 @@ tags, and then generate with `hack/update-toc.sh`. - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) - [CRI protocol](#cri-protocol) + - [ContainerConfig](#containerconfig) + - [UpdateContainerResourcesRequest](#updatecontainerresourcesrequest) + - [PodSandboxConfig](#podsandboxconfig) + - [RuntimeStatus](#runtimestatus) + - [Consts](#consts) - [Pod annotations](#pod-annotations) + - [Kubelet](#kubelet) + - [API server](#api-server) - [Container runtimes](#container-runtimes) - [Open Questions](#open-questions) - [Pod QoS class](#pod-qos-class) @@ -292,10 +298,10 @@ know that this has succeeded? - Make the extensions flexible, enabling simple addition of other QoS-class resource types in the future. - Make QoS-class resources opqaue (as possible) to the CRI client +- Discovery of the available QoS-class resources - API changes to support updating Pod-level (sandbox-level) QoS-class resource assignment of running pods ([future work](#future-work)) - Resource status/capacity ([future work](#future-work)) -- Discovery of the QoS-class resources ([future work](#future-work)) - Access control ([future work](#future-work)) ### Non-Goals @@ -334,7 +340,10 @@ control of the resources and the classes within. The goal is to enable a bare minimum for users to leverage QoS-class resources and start experimenting with them in Kubernetes: -- extend the CRI protocol to allow QoS-class resource assignment and updates +- extend the CRI protocol to allow QoS-class resource assignment and updates to + be communicated from kubelet to the runtime +- extend the CRI protocol to allow runtime to communicate available QoS-class + resources (the types of resources and the classes within) to kubelet - implement pod annotations as an initial user interface - introduce a feature gate for enabling QoS-class resource support in kubelet @@ -467,66 +476,6 @@ Some alternatives for presenting this information: necessarily that neatly align with two level hierarchy (resource name and a set of classes within). Also, only best suited to homogenous clusters. -#### Resource discovery - -This future step will add support for discovery of available QoS-class resource -types (and the classes within each type) on each node. - -Resource discovery together with resource status/capacity information (above) -enables scheduler support for QoS-class resources. This would also make it -possible to delete/evict pods from nodes when requested QoS-class resource -types (or classes within) are no longer available. - -The discovery needs to be able to carry the following information: - -- Available QoS-class resource types. -- Available classes within each resource type. -- Whether the resource type is immutable or if it supports in-place updates. - In-place updates of resoures might not be possible because of runtime - limitations or the underlying technology, for example. - -Some possible alternatives. - -1. Reported by the container runtime. Container runtime is (or at least should - be) aware of all resource types and the classes within. It could advertise - the resources e.g. via either: - - 1. A separate gRPC endpoint or update `StatusResponse` - - ```diff - message RuntimeStatus { - // List of current observed runtime conditions. - repeated RuntimeCondition conditions = 1; - + // Information about the discovered resources - + ResourcesInfo resources = 2; - +} - + - +// ResourcesInfo contains information about the resources discovered by the - +// runtime. - +message ResourcesInfo { - + // Pod-level class resources available. - + repeated ClassResourceInfo pod_class_resources = 1; - + // Container-level class resources available. - + repeated ClassResourceInfo container_class_resources = 2; - +} - + - +// ClassResourceInfo contains information about one type of class resource. - +message ClassResourceInfo { - + string Name = 1; - + repeated string classes = 2; - + bool immutable = 3; - } - ``` - - 1. OR Populate a (json) file in a known location - - Of these, the first option is more idiomatic for how CRI behaves today. - As a reference, the API currently allows listing of some objects/resources - (Pods, Containers, Images etc) but not some others. - -1. Manual configuration. Would be best suited for case where resources and - classes would be presented as separate API objects. - #### Access control This future step adds support for controlling the access to available QoS-class @@ -583,7 +532,17 @@ nitty-gritty. --> We extend the CRI protocol to contain information about the QoS-class -resource assignment of containers and pods. +resource assignment of containers and pods. Resource assignment requests will +be simple key-value pairs (*resource-type=class-name*) + +Container runtime is expected to be aware of all resource types and the classes +within. The CRI protocol is extended to be able to communicate the available +QoS-class resources from the runtime to the client. This information includes: +- Available QoS-class resource types. +- Available classes within each resource type. +- Whether the resource type is immutable or if it supports in-place updates. + In-place updates of resoures might not be possible because of runtime + limitations or the underlying technology, for example. Pod-level and container-level QoS-class resources are completely independent resource types. E.g. specifying something in the pod-level request does not @@ -651,8 +610,8 @@ This might be a good place to talk about core concepts and how they relate. Implementation Phase 1 is only the first step in getting QoS-class resources supported in Kubernetes. Important pieces like resource assignment via pod -spec, resource status, resource disovery and permission control are [future -work](#future-work) not fully solved here. The risk in this sort of piecemeal +spec, resource status and permission control are [future work](#future-work) +not fully solved here. The risk in this sort of piecemeal approach is finding devil in the details, resulting in inconsistent and/or crippled and/or cumbersome end result. However, there is a lot of experience in extending the API and understanding which sort of solutions are functional and @@ -677,9 +636,7 @@ Consider including folks who also work outside the SIG or subproject. KEP introducing permission controls. - Confusion: user tries to assign container to RDT class but RDT has not been enabled on system(s). This will be addressed by future KEP(s) introducing - resource discovery and status. -- Keeping client (kubelet) and runtime in sync wrt to available classes. Will - be addressed in future KEP about resource discovery. + resource availability status. ## Design Details @@ -698,9 +655,12 @@ client is returned if the specified class is not available. The following additions to the CRI protocol are suggested. -The `ContainerConfig` message will be supplemented with new `class_resources` -field, providing per-container setting for QoS-class resources. +#### ContainerConfig +The `ContainerConfig` message will be supplemented with new `class_resources` +field, providing per-container setting for QoS-class resources. This will be +used in `CreateContainerRequest` to communicate the container-level QoS-class +resource assignments to the runtime. ```diff message ContainerConfig { @@ -724,13 +684,17 @@ field, providing per-container setting for QoS-class resources. +} ``` -The `UpdateContainerResourcesRequest` message will be similarly extended to -allow updating of QoS-class resource configuration of a running container. -Depending on runtime-level support of a particular resource (and possibly the -type of resource) UpdateContainerResourcesRequest might fail. Later phases -(with resource discovery/status) adds the ability to distinguish immutable -resource types. Note that neither of the existing QoS-class resource types (RDT -or blockio) support updates because of runtime limitations, yet. +#### UpdateContainerResourcesRequest + +Similar to `CreateContainerRequest`, the `UpdateContainerResourcesRequest` +message will extended to allow updating of QoS-class resource configuration of +a running container. Depending on runtime-level support of a particular +resource (and possibly the type of resource) UpdateContainerResourcesRequest +might fail. Resource discovery (see [Runtime status](#runtime-status) the has +the capability to distinguish immutable resource types. + +Note that neither of the existing QoS-class resource types (RDT or blockio) +support updates because of runtime limitations, yet. ```diff message UpdateContainerResourcesRequest { @@ -743,9 +707,12 @@ or blockio) support updates because of runtime limitations, yet. } ``` -The `PodSandboxConfig` will be supplemented with a corresponding -`class_resources` field that specifies the assignment of pod-level QoS-class -resources. +#### PodSandboxConfig + +The `PodSandboxConfig` will be supplemented with a new `class_resources` field +that specifies the assignment of pod-level QoS-class resources. The intended +use for this would be to be able to communicate pod-level QoS-class resource +assignments at sandbox creation time (`RunPodSandboxRequest`). ```diff message PodSandboxConfig { @@ -755,7 +722,6 @@ resources. WindowsPodSandboxConfig windows = 9; + // Configuration of QoS-class resources. + PodClassResources class_resources = 10; -+ } +// PodClassResources specifies the configuration of QoS-class resources @@ -767,11 +733,51 @@ resources. +} ``` +#### RuntimeStatus + +Extend the `RuntimeStatus` message with new `resources` field that is used to +communicate the available QoS-class resources from the runtime to the client. + +This information can be used by the client (kubelet) to validate QoS-class +resource assignments before starting a pod. In future steps kubelet will patch +this information into node status. + +```diff + message RuntimeStatus { + // List of current observed runtime conditions. + repeated RuntimeCondition conditions = 1; ++ // Information about the discovered resources ++ ResourcesInfo resources = 2; ++} + ++// ResourcesInfo contains information about the resources discovered by the ++// runtime. ++message ResourcesInfo { ++ // Pod-level QoS-class resources available. ++ repeated ClassResourceInfo pod_class_resources = 1; ++ // Container-level QoS-class resources available. ++ repeated ClassResourceInfo container_class_resources = 2; ++} + ++// ClassResourceInfo contains information about one type of QoS-class resource. ++message ClassResourceInfo { ++ string Name = 1; ++ repeated ClassResourceClassInfo classes = 2; ++} + ++// ClassResourceClassInfo contains information about single class of one ++// QoS-class resource type. ++message ClassResourceClassInfo { ++ string Name = 1; + } +``` + +#### Consts + Also, define "known" QoS-class resource types to more easily align container runtime implementations: ```diff -+ +const ( + // ClassResourceRdt is the name of the RDT QoS-class resource + ClassResourceRdt = "rdt" @@ -785,18 +791,12 @@ runtime implementations: Use Pod annotation as the initial K8s user interface, similar to e.g. how seccomp support was added. This will bridge the gap between the first implementation phase, i.e. enabling QoS-class resources in the CRI protocol, -and the future work which makes them available in the Pod spec. Kubelet will -read the specific Pod annotations and translate them into corresponding fields -in the CRI messages. +and the future work which makes them available in the Pod spec. -A feature gate ClassResources enables kubelet to look for pod -annotations and set the QoS-class resource assignment via CRI protocol accordingly. +Specifically, annotations for specifying RDT and blockio class will be +supported. These are the two types of QoS-class resources that already have +basic support in the container runtimes. -Specifically, kubelet will support annotations for specifying RDT and blockio -class, the two types of QoS-class resources that already have basic support in the -container runtimes. - - class for all containers - `rdt.resources.alpha.kubernetes.io/default` for setting a Pod-level default RDT class for all containers - `rdt.resources.alpha.kubernetes.io/container.` for @@ -807,11 +807,32 @@ container runtimes. - `blockio.resources.alpha.kubernetes.io/container.` for container-specific blockio class settings +### Kubelet + +Kubelet will interpret the specific [pod annotations](#pod-annotations) and +translate them into corresponding `ClassResources` data in the CRI +ContainerConfig message at container creation time (CreateContainerRequest). +Pod-level QoS-class resources are not supported at this point (via pod +annotations). + +Kubelet will receive the information about available QoS-class resources (the +types of reqources and their classes) from the runtime over the CRI API (new +Resources field in RuntimeStatus message). An admission handler is added to +kubelet to validate the QoS-class resource request against the resource +availability on the node. Pod is rejected if sufficient resources do not exist. + +A feature gate ClassResources enables kubelet to interpretthe specific pod +annotations. If the feature gate is disabled the annotations are simply ignored +by kubelet. + +### API server + A validation check (core api validation) is added in the API server to reject -changes to these annotations after a Pod has been created. This ensures that -the annotations always reflect the actual assignment of QoS-class resources of -a Pod. It also serves as part of the UX to indicate the in-place updates of the -resources via annotations is not supported. +changes to the QoS-class resource specific [pod annotations](#pod-annotations) +after a Pod has been created. This ensures that the annotations always reflect +the actual assignment of QoS-class resources of a Pod. It also serves as part +of the UX to indicate the in-place updates of the resources via annotations is +not supported. ### Container runtimes @@ -829,6 +850,9 @@ The design paradigm here is that the container runtime configures the QoS-class resources according to a given configuration file. Enforcement on containers is done via OCI. User interface is provided through pod and container annotations. +Container runtimes will be updated to support the +[CRI API extensions](#cri-api) + ### Open Questions #### Pod QoS class @@ -841,9 +865,6 @@ would be better to explicitly state the Pod QoS class and QoS-class resources wo look like a logical place for that. This also makes it techically possible to have container-specific QoS classes (as a possible future enhancement of K8s). -Communicating Pod QoS class via QoS-class resources would advocate moving -QoS-class resources up to `ContainerConfig`. - Making this change, it would also be possible to separate `oom_score_adj` from the pod qos class in the future. The runtime could provide a set of OOM classes, making it possible for the user to specify a burstable pod with low From fae82bae54436271780744c7d9ff7a218680106e Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 27 Sep 2022 21:29:41 +0300 Subject: [PATCH 37/92] KEP-3008: update graduation criteria --- keps/sig-node/3008-cri-class-based-resources/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 613dc08ef81..d9bc5f0705a 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -1026,13 +1026,13 @@ in back-to-back releases. #### Beta - Gather feedback from developers and surveys -- In addition to the simple change in CRI API, implement the following +- In addition to the changes in CRI API, implement the following - Pod spec update - - Resource discovery - Resource status/capacity (with scheduling) - Parmission control - Well-defined behavior with [In-place pod vertical scaling](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources) - Additional tests are in Testgrid and linked in KEP +- User documentation is available #### GA From e4491cf6e68114297ddad14e6e746e3e5b9fe6bc Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Thu, 29 Sep 2022 11:31:41 +0300 Subject: [PATCH 38/92] KEP-3008: update figure Based on feedback, slightly update the figure explaining the design. Also, small update to the Motivation section. --- .../3008-cri-class-based-resources/README.md | 14 ++++++++++---- .../3008-cri-class-based-resources/design.svg | 2 +- 2 files changed, 11 insertions(+), 5 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index d9bc5f0705a..043655c8eb1 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -246,6 +246,12 @@ support those tehcnologies. This proposal suggests to address the issue above in a generalized way by extending the Kubernetes resource model with a new type of resources, i.e. QoS-class resources. +This KEP identifies two technologies that can immediately be enabled with +QoS-class resources. However, these are just two examples and the proposed +changes are generic (and not tied to these two QoS-class resource types in any +way), making it easy to implement new QoS-class resource types in the runtimes +without any changes in Kubernetes. + [Intel RDT][intel-rdt] implements a class-based mechanism for controlling the cache and memory bandwidth QoS of applications. All processes in the same hardware class share a portion of cache lines and memory bandwidth. RDT @@ -267,10 +273,10 @@ enabling class-based network or memory type prioritization of applications. Currently, there is no mechanism in Kubernetes to use these types of resources. CRI-O and containerd runtimes have support for RDT and blockio classes and they -provide an bridge-gap user interface through special pod annotations. We would -like to eventually get these types of resources first class citizen and -properly supported in Kubernetes, providing visibility, a well-defined user -interface, and permission controls. +provide an bridge-gap user interface through special pod annotations. The goal +is to get these types of resources first class citizens and properly supported +in Kubernetes, providing visibility, a well-defined user interface, and +permission controls. It seems necessary to support both container-level and pod-level QoS-class resources as independent concepts. Intel RDT (above) is per-container by diff --git a/keps/sig-node/3008-cri-class-based-resources/design.svg b/keps/sig-node/3008-cri-class-based-resources/design.svg index bf989f43057..f632363e200 100644 --- a/keps/sig-node/3008-cri-class-based-resources/design.svg +++ b/keps/sig-node/3008-cri-class-based-resources/design.svg @@ -1 +1 @@ -PodSpecresources:classes:rdt: goldblockio: hi-prioNode Xcapacity:rdt: [gold, silver, bronze]blockio: [hi-prio, lo-prio]API serverSchedulerruntimesystemkubeletnode Xenforce resource assignmentdiscoverresourcesannounceresourcespod is createdfind suitable nodepod is scheduledupdate node statuscontainercreated45637218out-of-scopephase 1 \ No newline at end of file +PodSpecNode XAPI serverSchedulerruntimesystemkubeletnode Xenforce QoS-class resource assignmentinitialize QoS-classresourcesannounceQoS-class resourcespod is createdfind suitable nodepod is scheduledupdate node statuspod & containerscreated45637218implementation detailsphase 1future phasesresources:classes:qos-res-a: goldqos-res-c: hi-priocapacity:qos-res-a:[gold, silver, bronze]qos-res-b:[red, yellow, green]qos-res-c:[hi-prio, lo-prio] \ No newline at end of file From 062d6ffede94778ee586c6e89fcc58c846ea58d8 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Mon, 3 Oct 2022 16:04:38 +0300 Subject: [PATCH 39/92] KEP-3008: address review feedback from rphillips State the current scope (implementation phase 1) in the Proposal and Design Detail sections. --- keps/sig-node/3008-cri-class-based-resources/README.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 043655c8eb1..b4791bd5496 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -537,6 +537,10 @@ The "Design Details" section below is for the real nitty-gritty. --> +This section currently covers [implementation phase 1](#phase-1) (see +[implementation phases](#implementation-phases) for an outline of the complete +implementation). + We extend the CRI protocol to contain information about the QoS-class resource assignment of containers and pods. Resource assignment requests will be simple key-value pairs (*resource-type=class-name*) @@ -653,6 +657,10 @@ required) or even code snippets. If there's any ambiguity about HOW your proposal will be implemented, this is the place to discuss them. --> +The detailed design presented here covers [implementation phase 1](#phase-1) +(see [implementation phases](#implementation-phases) for an outline of all +planned changes). + Configuration and management of the QoS-class resources is fully handled by the underlying container runtime and is invisible to kubelet. An error to the CRI client is returned if the specified class is not available. From b0e23af3da66b4e2894959de55b7372a5e541846 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Mon, 3 Oct 2022 19:23:30 +0300 Subject: [PATCH 40/92] KEP-3008: changes based on comments from haircommander State more clearly that implementation phase 1 is largely about CRI API and the future phases are about K8s API. --- .../3008-cri-class-based-resources/README.md | 20 ++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index b4791bd5496..06febd1e0c2 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -331,9 +331,13 @@ hopefully resulting in a better overall end result. However, we also outline all the future steps to not lose the overall big picture. The figure below illustrates the design of the full implementation (less quota) -and the part the first implementation phase covers. This KEP (the -[Proposal](#proposal)) in its current form implements this first phase – the -KEP will evolve and be supplemented with future phases getting implemented. +and the division of implementation phases. The first implementation phase +basically covers the communication between kubelet and the container runtime +(i.e. CRI API). All changes to the Kubernetes API and its control plane +components are left to future work. This KEP (the [Proposal](#proposal) and +[Design Details](#design-details)) in its current form implements this first +phase – the KEP will evolve and be supplemented with future phases getting +implemented. ![design](./design.svg) @@ -358,13 +362,15 @@ and start experimenting with them in Kubernetes: This section sheds light on the end goal of this work in order to better evaluate this KEP in a broader context. What a fully working solution would consists of and what the (next) steps to accomplish that would be. These topics -are currently out of the scope of this KEP and were listed under -[Non-goals](#non-goals). +are currently listed as "future work" in [Goals](#goals). + +In practice, the future work mostly consists of changes to the Kubernetes API +and control plane components. #### Pod Spec -This future step will replace pod annotations with proper user interface via -the Pod spec. Below, one possible option is presented. +This future step will replace pod annotations with proper user interface in the +Kubernetes API, i.e. PodSpec. Below, one possible option is presented. Introduce a new field (e.g. class) into ResourceRequirements of Container. From c92d10463997f2eaf31ebd68fa5ccdcb92faaee4 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Mon, 3 Oct 2022 20:55:32 +0300 Subject: [PATCH 41/92] KEP-3008: address review comments from mikebrow --- .../3008-cri-class-based-resources/README.md | 22 +++++++------------ 1 file changed, 8 insertions(+), 14 deletions(-) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md index 06febd1e0c2..bb9a3f17ff4 100644 --- a/keps/sig-node/3008-cri-class-based-resources/README.md +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -212,19 +212,14 @@ Main characteristics of the new resource type (and the technologies they are aimed at enabling) are: - multiple containers can be assigned to the same class of a certain type of - resource -- resources are represented by a limited set of class identifiers -- each type of resource has its own set of class identifiers + QoS-class resource +- QoS-class resources are represented by an enumerable set of class identifiers +- each type of QoS-class resource has an independent set of class identifiers -With QoS-class resources, Pods and their containers can request -opaque QoS-class identifiers (classes) for some particular mechanism -(QoS-class resource type), such as block I/O bandwidth. Kubelet relays this -information to the container runtime which is responsible for enforcing the -request in the underlying system. - -A prime example of a QoS-class resource is Intel RDT (Resource Director -Technology). RDT is a technology for controlling the cache lines and memory -bandwidth available to applications. +With QoS-class resources, Pods and their containers can request opaque +QoS-class identifiers (classes) of certain QoS mechanism (QoS-class resource +type). Kubelet relays this information to the container runtime which is +responsible for enforcing the request in the underlying system. ## Motivation @@ -249,8 +244,7 @@ of resources, i.e. QoS-class resources. This KEP identifies two technologies that can immediately be enabled with QoS-class resources. However, these are just two examples and the proposed changes are generic (and not tied to these two QoS-class resource types in any -way), making it easy to implement new QoS-class resource types in the runtimes -without any changes in Kubernetes. +way), making it easier to implement new QoS-class resource types. [Intel RDT][intel-rdt] implements a class-based mechanism for controlling the cache and memory bandwidth QoS of applications. All processes in the same From b3bfda4a6192625e87a4576f635ef0b04405f431 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 4 Oct 2022 13:57:33 +0300 Subject: [PATCH 42/92] KEP-3008: rename kep subdir Thanks to sftim --- .../README.md | 0 .../design.svg | 0 .../kep.yaml | 0 3 files changed, 0 insertions(+), 0 deletions(-) rename keps/sig-node/{3008-cri-class-based-resources => 3008-qos-class-resources}/README.md (100%) rename keps/sig-node/{3008-cri-class-based-resources => 3008-qos-class-resources}/design.svg (100%) rename keps/sig-node/{3008-cri-class-based-resources => 3008-qos-class-resources}/kep.yaml (100%) diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md similarity index 100% rename from keps/sig-node/3008-cri-class-based-resources/README.md rename to keps/sig-node/3008-qos-class-resources/README.md diff --git a/keps/sig-node/3008-cri-class-based-resources/design.svg b/keps/sig-node/3008-qos-class-resources/design.svg similarity index 100% rename from keps/sig-node/3008-cri-class-based-resources/design.svg rename to keps/sig-node/3008-qos-class-resources/design.svg diff --git a/keps/sig-node/3008-cri-class-based-resources/kep.yaml b/keps/sig-node/3008-qos-class-resources/kep.yaml similarity index 100% rename from keps/sig-node/3008-cri-class-based-resources/kep.yaml rename to keps/sig-node/3008-qos-class-resources/kep.yaml From 70d588c7ba46f73391e6c8f587d7e7d56bd2fc0e Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Thu, 6 Oct 2022 09:53:05 +0300 Subject: [PATCH 43/92] KEP-3008: address review comments from johnbelamaric --- keps/sig-node/3008-qos-class-resources/README.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index bb9a3f17ff4..16d91bf6066 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -237,7 +237,7 @@ applications in Kubernetes by introducing a new type of resource control mechanism. Certain types of resources are inherently shared by application (e.g. cache, memory bandwidth and disk I/O) and while there are technologies for controlling these, there is currently no meaningful way in Kubernetes to -support those tehcnologies. This proposal suggests to address the issue above +support those technologies. This proposal suggests to address the issue above in a generalized way by extending the Kubernetes resource model with a new type of resources, i.e. QoS-class resources. @@ -320,7 +320,7 @@ and make progress. This proposal splits the full implementation of QoS-class resources into multiple phases, building functionality gradually, step-by-step. The goal is to make the discussions more focused and easier. We may also learn on the way, -insights from earlier phases affecting design choises made in the later phases, +insights from earlier phases affecting design choices made in the later phases, hopefully resulting in a better overall end result. However, we also outline all the future steps to not lose the overall big picture. @@ -423,7 +423,7 @@ in the key is allowed, similar to labels, e.g. `vendor/resource`. #### Update sandbox-level QoS-class resources -This future step would be a second extesion to the CRI API. +This future step would be a second extension to the CRI API. Currently there is no endpoint in the CRI API to update the configuration of pod sandboxes. In contrast, container-level resources can be updated with the @@ -1134,6 +1134,7 @@ well as the [existing list] of feature gates. - Components depending on the feature gate: - Implementation Phase 1: - kubelet + - kube-apiserver (validation of annotations) - Future phases (with updated pod spec and scheduler and quota support): - kubelet - kube-apiserver From 1dad80003fd83d09a28249286b2c999d93856ffd Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Thu, 6 Oct 2022 19:04:00 +0300 Subject: [PATCH 44/92] KEP-3008: update kep.yaml Thanks to johnbelamaric. --- keps/sig-node/3008-qos-class-resources/kep.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/keps/sig-node/3008-qos-class-resources/kep.yaml b/keps/sig-node/3008-qos-class-resources/kep.yaml index 936a10989d9..081d97533f0 100644 --- a/keps/sig-node/3008-qos-class-resources/kep.yaml +++ b/keps/sig-node/3008-qos-class-resources/kep.yaml @@ -27,6 +27,7 @@ feature-gates: - name: ClassResources components: - kubelet + - kube-apiserver disable-supported: true # The following PRR answers are required at beta release From dc336bdad68bf1135fe31170e7ebf4eea2521f97 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Fri, 7 Oct 2022 15:31:24 +0300 Subject: [PATCH 45/92] KEP-3008: address review comments from mikebrow and derekwaynecarr --- .../3008-qos-class-resources/README.md | 27 ++++++++++++------- 1 file changed, 18 insertions(+), 9 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 16d91bf6066..665af7a7d63 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -218,8 +218,8 @@ aimed at enabling) are: With QoS-class resources, Pods and their containers can request opaque QoS-class identifiers (classes) of certain QoS mechanism (QoS-class resource -type). Kubelet relays this information to the container runtime which is -responsible for enforcing the request in the underlying system. +type). Kubelet relays this information to the container runtime, without +directing how the request is enforced in the underlying system. ## Motivation @@ -233,13 +233,14 @@ demonstrate the interest in a KEP within the wider Kubernetes community. --> This enhancement proposal aims at improving the quality of service of -applications in Kubernetes by introducing a new type of resource control -mechanism. Certain types of resources are inherently shared by application (e.g. -cache, memory bandwidth and disk I/O) and while there are technologies for -controlling these, there is currently no meaningful way in Kubernetes to -support those technologies. This proposal suggests to address the issue above -in a generalized way by extending the Kubernetes resource model with a new type -of resources, i.e. QoS-class resources. +applications by implementing a new type of resource control mechanism in +Kubernetes. Certain types of resources are inherently shared by applications, +e.g. cache, memory bandwidth and disk I/O. While there are technologies for +controlling how these resources are shared between applications, there is +currently no meaningful way to support these technologies in Kubernetes. This +proposal suggests to address the issue above in a generalized way by extending +the Kubernetes resource model with a new type of resources, i.e. QoS-class +resources. This KEP identifies two technologies that can immediately be enabled with QoS-class resources. However, these are just two examples and the proposed @@ -554,6 +555,14 @@ QoS-class resources from the runtime to the client. This information includes: In-place updates of resoures might not be possible because of runtime limitations or the underlying technology, for example. +QoS-class resources are unbounded in the way that any number of applications +can request the same class (of the same QoS-class resource). QoS-class +resources typically represent some throttling mechanism - in this case for +example, if all (or most of) applications request the highest-tier class all of +these applications get the same level of service. +Future work on [access control(#access-control) will provide mechanisms to +limit what QoS-class resources (or classes) are available to users. + Pod-level and container-level QoS-class resources are completely independent resource types. E.g. specifying something in the pod-level request does not mean specifying a pod-level default for all containers of the pod. From 0d981ad15d915aec707a517472c0c7a7acf3cb43 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Mon, 10 Oct 2022 11:26:24 +0300 Subject: [PATCH 46/92] KEP-3008: address one review comment from mikebrow --- keps/sig-node/3008-qos-class-resources/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 665af7a7d63..6bf2b25db73 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -233,7 +233,7 @@ demonstrate the interest in a KEP within the wider Kubernetes community. --> This enhancement proposal aims at improving the quality of service of -applications by implementing a new type of resource control mechanism in +applications by making available a new type of resource control mechanism in Kubernetes. Certain types of resources are inherently shared by applications, e.g. cache, memory bandwidth and disk I/O. While there are technologies for controlling how these resources are shared between applications, there is From 2300fab7040b489a3a007f4e2e89294bac0e045d Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Mon, 10 Oct 2022 14:04:12 +0300 Subject: [PATCH 47/92] KEP-3008: fix inconsistencies in the APIs --- .../3008-qos-class-resources/README.md | 85 ++++++++++++++----- 1 file changed, 63 insertions(+), 22 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 6bf2b25db73..a9da8efc184 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -465,18 +465,36 @@ Some alternatives for presenting this information: // Defaults to Capacity. // +optional Allocatable ResourceList `json:"allocatable,omitempty" protobuf:"bytes,2,rep,name=allocatable,casttype=ResourceList,castkey=ResourceName"` - + // PodClassResrouces lists the available class resources available for pod sandboxes. - + PodClassResources []ClassResourceList - + // ContainerClassResrouces lists the available class resources available for containers. - + ContainerClassResources []ClassResourceList - + - +type ClassResourceList { - + // Name of the resource + + // ClassResources contains information about the class resources that are + + // available on the node. + + // +optional + + ClassResources ClassResourceStatus + ... + + +// ClassResourceStatus describes class resources available on the node. + +type ClassResourceStatus struct { + + // PodClassResources contains the class resources that are available for + + // pods to be assigned to. + + PodClassResources []ClassResourceInfo + + // ContainerClassResources contains the class resources that are available + + // for containers to be assigned to. + + ContainerClassResources []ClassResourceInfo + +} + + +// ClassResourceInfo contains information about one class resource type. + +type ClassResourceInfo struct { + + // Name of the resource. + Name ClassResourceName - + // Classes available in the resource - + Classes []string - + // Immutable is set to true if the resource type does not support in-place updates - + Immutable bool + + // Mutable is set to true if the resource supports in-place updates. + + Mutable bool + + // Classes available for assignment. + + Classes []ClassResourceClassInfo + +} + + +// ClassResourceClassInfo contains information about single class of one + +// QoS-class resource. + +type ClassResourceClassInfo struct { + + Name string +} ``` 1. Separate API objects (e.g. something like `RuntimeClass`). Doesn't @@ -507,22 +525,41 @@ would implement restrictions based on the namespace. // object tracked by a quota but expressed using ScopeSelectorOperator in combination // with possible values. ScopeSelector *ScopeSelector -+ // PodClassResources contains the allowed pod-level class resources. -+ PodClassResources []ClassResourceInfo -+ // ContainerClassResources contains the allowed container-level class resources. -+ ContainerClassResources []ClassResourceInfo ++ // ClassResources contains the desired set of allowed class resources. ++ // +featureGate=ClassResources ++ // +optional ++ ClassResources ClassResourceQuota } ++// ClassResourceQuota contains the allowed class resources. ++type ClassResourceQuota struct { ++ // Pod contains the allowed class resources for pods. ++ // +optional ++ Pod []AllowedClassResource ++ // Container contains the allowed class resources for pods. ++ // +optional ++ Container []AllowedClassResource ++} + ++// AllowedClassResource specifies access to one QoS-class resources type. ++type AllowedClassResource struct { ++ // Name of the resource. ++ Name ClassResourceName ++ // Allowed classes. ++ Classes []string ++} + + // ResourceQuotaStatus defines the enforced hard limits and observed use. type ResourceQuotaStatus struct { ... // Used is the current observed total usage of the resource in the namespace // +optional Used ResourceList -+ // PodClassResources contains the enforced set of pod-level class resources available. -+ PodClassResources []ClassResourceInfo -+ // ContainerClassResources contains the enforced set of container class resources available. -+ ContainerClassResources []ClassResourceInfo ++ // ClassResources contains the enforced set of available class resources. ++ // +featureGate=ClassResources ++ // +optional ++ ClassResources ClassResourceQuota } @@ -713,7 +750,7 @@ Similar to `CreateContainerRequest`, the `UpdateContainerResourcesRequest` message will extended to allow updating of QoS-class resource configuration of a running container. Depending on runtime-level support of a particular resource (and possibly the type of resource) UpdateContainerResourcesRequest -might fail. Resource discovery (see [Runtime status](#runtime-status) the has +might fail. Resource discovery (see [Runtime status](#runtime-status)) has the capability to distinguish immutable resource types. Note that neither of the existing QoS-class resource types (RDT or blockio) @@ -782,10 +819,14 @@ this information into node status. + repeated ClassResourceInfo container_class_resources = 2; +} -+// ClassResourceInfo contains information about one type of QoS-class resource. ++// ClassResourceInfo contains information about one type of class resource. +message ClassResourceInfo { ++ // Name of the QoS-class resources. + string Name = 1; -+ repeated ClassResourceClassInfo classes = 2; ++ // Mutable is set to true if this resource sypports dynamic updates. ++ bool Mutable = 2; ++ // List of classes of this resource. ++ repeated ClassResourceClassInfo classes = 3; +} +// ClassResourceClassInfo contains information about single class of one From 19889afa203746f2e193f156d4be115cbdfe0ff5 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Wed, 30 Nov 2022 19:11:19 +0200 Subject: [PATCH 48/92] KEP-3008: add "class capacity" to open questions --- .../3008-qos-class-resources/README.md | 125 ++++++++++++++++++ 1 file changed, 125 insertions(+) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index a9da8efc184..fc8d034a253 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -109,6 +109,9 @@ tags, and then generate with `hack/update-toc.sh`. - [API server](#api-server) - [Container runtimes](#container-runtimes) - [Open Questions](#open-questions) + - [Class capacity](#class-capacity) + - [Class capacity with extended resources](#class-capacity-with-extended-resources) + - [Class capacity in the API](#class-capacity-in-the-api) - [Pod QoS class](#pod-qos-class) - [Default class](#default-class) - [Test Plan](#test-plan) @@ -919,6 +922,128 @@ Container runtimes will be updated to support the ### Open Questions +#### Class capacity + +Resource quota provides a way to restrict access to certain types of QoS-class +resources and their classes. However, the current design does not contain any +mechanism for controlling the number of assigments into a certain class on +per-node or per-namespace basis. That is. say that "only 2 containers can be +assigned to RDT class *gold* on this node". + +This concept was deliberately left out of the proposal in order to reduce +confusion with other countable resources (native resources, extended +resources). Also, capacity does not often fully fit together with QoS +prioritization: generally there needs to be at least one "unlimited" class as +each container or pod needs to belong to some class. + +However, there are at least two mechamisms how class capacity can be +implemented. The first is by leveraging extended resources and an admission +webhook to limit the usage of classes. The other alternative is to implement +class capacity in the API directly. + +##### Class capacity with extended resources + +It would be possible to implement "class capacity" by leveraging extended +resources and mutating admission webhooks: + +1. An extended resource with the desited capacity is created for each class + which needs to be controlled. A possible example: + ```plaintext + Allocatable: + qos/container/resource-X/class-A: 2 + qos/container/resource-X/class-B: 4 + qos/container/resource-Y/class-foo: 1 + ``` + + In this example *class-A* and *class-B* of QoS-class resource *resource-X* + would be limited to 2 and 4 assignments, respectively (on this node). + Similarly, class *class-foo* of QoS-class resource *resource-Y* would be + limited to only one assignment. + +2. A specific mutating admission webhook is deployed which syncs the specific + extended resources. That is, if a specific QoS-class resource assignment is + requested in the pod spec, a request for the corresponding extended resource + is automatically added. And vice versa: if a specific extended resource is + requested a QoS-class resource request is automatically put in place. + +However, this approach is quite involved, has shortcomings and caveats and thus +it is probably suitable only for targeted usage scenarios, not as a general +solution. Downsides include: + +- requires implementation of "side channel" control mechanisms, e.g. admission + webhook and some solution for capacity management (extended resources) +- deployment of admission webhooks is cumbersome +- management of capacity is limited and cumbersome + - management of extended resources needs a separate mechanism + - ugly to handle a scenario where *class-A* on *node-1* would be unlimited + but on *node-2* it would have some capacity limit (e.g. use a very high + integer capacity of the extended resource on *node-2*) + - dynamic changes are harder +- not well-suited for managing pod-level QoS-class resources as pod-level + resource requests (for the extended resource) are not supported in Kubernetes +- possible confusion for users regarding the double accounting (QoS-class + resources and extended resources) + +##### Class capacity in the API + +An alternative is to include the concept of "class capacity" in the APIs +directly. In this solution the per-node capacity would be controlled by the +runtime and visible in node status. Also resource quota could be extended to +limit and track capacity, giving per-namespace capacity control. + +Adding "class capacity" to the APIs could be a separete future implementation +phase if it is not desired in the first round of API changes. + +###### CRI + +Add capacity both to resource discovery. + +```diff + message ClassResourceClassInfo { + // Name of the class + string name = 1; ++ // Capacity is the number of maximum allowed simultaneous assignments into this class ++ // Zero means "infinite" capacity i.e. the usage is not restricted. ++ uint64 capacity = 2; ++ + } +``` + +###### NodeStatus + +Corresponding to the CRI changes, class capacity would be added to the class +info in NodeStatus. + +```diff + type ClassResourceClassInfo struct { + // Name of the class. + Name string ++ // Capacity is the number of maximum allowed simultaneous assignments into this class ++ // Zero means "infinite" capacity i.e. the usage is not restricted ++ // +optional ++ Capacity int64 + } +``` + +###### ResourceQuotaSpec + +ResourceQuota would be supplemented similarly. + +```diff + type AllowedClass struct { + // Name of the class. + Name string ++ // Capacity is the hard limit for usage of the class. ++ // +optional ++ Capacity int64 + } +``` + +It is worth noting that this change in ResourceQuota is independent from the +other. That is, resource quota could have "class capacity" even though there +wouldn't be such concept on the node status level. And vice versa: node status +could have "class capacity" without resource quota having this concept. + #### Pod QoS class The Pod QoS class could be communicated to the container runtime as a QoS-class From dfd9180831aeb299c58f5c8053026b948ed99b79 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Wed, 30 Nov 2022 19:41:23 +0200 Subject: [PATCH 49/92] KEP-3008: fix typo --- keps/sig-node/3008-qos-class-resources/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index fc8d034a253..55015d66808 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -991,7 +991,7 @@ directly. In this solution the per-node capacity would be controlled by the runtime and visible in node status. Also resource quota could be extended to limit and track capacity, giving per-namespace capacity control. -Adding "class capacity" to the APIs could be a separete future implementation +Adding "class capacity" to the APIs could be a separate future implementation phase if it is not desired in the first round of API changes. ###### CRI From 6a16e48b3dfe99305a99c26c3af747f8e5c9d7e5 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Wed, 30 Nov 2022 19:44:02 +0200 Subject: [PATCH 50/92] KEP-3008: fix typo --- keps/sig-node/3008-qos-class-resources/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 55015d66808..576ee9a16d0 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -936,7 +936,7 @@ resources). Also, capacity does not often fully fit together with QoS prioritization: generally there needs to be at least one "unlimited" class as each container or pod needs to belong to some class. -However, there are at least two mechamisms how class capacity can be +However, there are at least two mechanisms how class capacity can be implemented. The first is by leveraging extended resources and an admission webhook to limit the usage of classes. The other alternative is to implement class capacity in the API directly. From e98dbd8211edc3a3a3a34ba337b684fd5e0c6c37 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Mon, 12 Dec 2022 20:43:33 +0200 Subject: [PATCH 51/92] KEP-3008: fix title in kep.yaml --- keps/sig-node/3008-qos-class-resources/kep.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/3008-qos-class-resources/kep.yaml b/keps/sig-node/3008-qos-class-resources/kep.yaml index 081d97533f0..749d2965f56 100644 --- a/keps/sig-node/3008-qos-class-resources/kep.yaml +++ b/keps/sig-node/3008-qos-class-resources/kep.yaml @@ -1,4 +1,4 @@ -title: Support User Namespaces +title: QoS-class resources kep-number: 3008 authors: - "@marquiz" From da65495329b0055020adb22e10009d605a4151fe Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Wed, 4 Jan 2023 12:19:29 +0200 Subject: [PATCH 52/92] KEP-3008: API renaming ClassResource -> QoSResource Rename API types and fields from ClassResource* to QoSResource*. Also renames the feature gate. --- .../3008-qos-class-resources/README.md | 144 +++++++++--------- 1 file changed, 71 insertions(+), 73 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 576ee9a16d0..e626a9bf070 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -379,18 +379,18 @@ type ResourceRequirements struct { Limits ResourceList `json:"limits,omitempty" // Requests describes the minimum amount of compute resources required. Requests ResourceList `json:"requests,omitempty" -+ // Classes specifies the QoS-class resources that the container should be assigned -+ Classes map[ClassResourceName]string ++ // QoSResources specifies the QoS-class resources that the container should be assigned ++ QoSResources map[QoSResourceName]string } -+// ClassResourceName is the name of a QoS-class resource. -+type ClassResourceName string ++// QoSResourceName is the name of a QoS-class resource. ++type QoSResourceName string ``` Also, we add a `Resources` field to the `PodSpec`. We will re-use the existing -`ResourceRequirements` type but Limits and Requests must be left empty. Classes +`ResourceRequirements` type but Limits and Requests must be left empty. QoSResources may be set and they represent the Pod-level assignment of QoS-class resources, -comparable to the PodClassResources message in PodSandboxConfig in the CRI API. +comparable to the PodQoSResources message in PodSandboxConfig in the CRI API. ```diff type PodSpec struct { @@ -419,7 +419,7 @@ This phase would likely also wire QoS-class resources to [In-place pod vertical scaling](#1287), allowing updates of running containers. Input validation of classes very similar to labels is implemented: keys -(`ClassResourceName`) and values must be non-empty, less than 64 characters +(`QoSResourceName`) and values must be non-empty, less than 64 characters long, must start and end with an alphanumeric character and may contain only alphanumeric characters, dashes, underscores or dots (`-`, `_` or `.`). Similar to labels, a namespace prefix (FQDN subdomain separated with a slash) @@ -468,35 +468,35 @@ Some alternatives for presenting this information: // Defaults to Capacity. // +optional Allocatable ResourceList `json:"allocatable,omitempty" protobuf:"bytes,2,rep,name=allocatable,casttype=ResourceList,castkey=ResourceName"` - + // ClassResources contains information about the class resources that are + + // QoSResources contains information about the class resources that are + // available on the node. + // +optional - + ClassResources ClassResourceStatus + + QoSResources QoSResourceStatus ... - +// ClassResourceStatus describes class resources available on the node. - +type ClassResourceStatus struct { - + // PodClassResources contains the class resources that are available for + +// QoSResourceStatus describes class resources available on the node. + +type QoSResourceStatus struct { + + // PodQoSResources contains the class resources that are available for + // pods to be assigned to. - + PodClassResources []ClassResourceInfo - + // ContainerClassResources contains the class resources that are available + + PodQoSResources []QoSResourceInfo + + // ContainerQoSResources contains the class resources that are available + // for containers to be assigned to. - + ContainerClassResources []ClassResourceInfo + + ContainerQoSResources []QoSResourceInfo +} - +// ClassResourceInfo contains information about one class resource type. - +type ClassResourceInfo struct { + +// QoSResourceInfo contains information about one class resource type. + +type QoSResourceInfo struct { + // Name of the resource. - + Name ClassResourceName + + Name QoSResourceName + // Mutable is set to true if the resource supports in-place updates. + Mutable bool + // Classes available for assignment. - + Classes []ClassResourceClassInfo + + Classes []QoSResourceClassInfo +} - +// ClassResourceClassInfo contains information about single class of one + +// QoSResourceClassInfo contains information about single class of one +// QoS-class resource. - +type ClassResourceClassInfo struct { + +type QoSResourceClassInfo struct { + Name string +} ``` @@ -528,26 +528,26 @@ would implement restrictions based on the namespace. // object tracked by a quota but expressed using ScopeSelectorOperator in combination // with possible values. ScopeSelector *ScopeSelector -+ // ClassResources contains the desired set of allowed class resources. -+ // +featureGate=ClassResources ++ // QoSResources contains the desired set of allowed class resources. ++ // +featureGate=QoSResources + // +optional -+ ClassResources ClassResourceQuota ++ QoSResources QoSResourceQuota } -+// ClassResourceQuota contains the allowed class resources. -+type ClassResourceQuota struct { ++// QoSResourceQuota contains the allowed class resources. ++type QoSResourceQuota struct { + // Pod contains the allowed class resources for pods. + // +optional -+ Pod []AllowedClassResource ++ Pod []AllowedQoSResource + // Container contains the allowed class resources for pods. + // +optional -+ Container []AllowedClassResource ++ Container []AllowedQoSResource +} -+// AllowedClassResource specifies access to one QoS-class resources type. -+type AllowedClassResource struct { ++// AllowedQoSResource specifies access to one QoS-class resources type. ++type AllowedQoSResource struct { + // Name of the resource. -+ Name ClassResourceName ++ Name QoSResourceName + // Allowed classes. + Classes []string +} @@ -559,10 +559,10 @@ would implement restrictions based on the namespace. // Used is the current observed total usage of the resource in the namespace // +optional Used ResourceList -+ // ClassResources contains the enforced set of available class resources. -+ // +featureGate=ClassResources ++ // QoSResources contains the enforced set of available class resources. ++ // +featureGate=QoSResources + // +optional -+ ClassResources ClassResourceQuota ++ QoSResources QoSResourceQuota } @@ -734,15 +734,14 @@ resource assignments to the runtime. // Configuration specific to Windows containers. WindowsContainerConfig windows = 16; + -+ // Configuration of QoS-class resources. -+ ContainerClassResources class_resources = 17; ++ // Configuration of QoS resources. ++ ContainerQoSResources qos_resources = 17; } -+// ContainerClassResources specifies the configuration of QoS-class resources -+// resources of a container. -+message ContainerClassResources { -+ // QoS-class resource assignment of the container. -+ // Key is the resource type and values is the class name within the resource type. ++// ContainerQoSResources specifies the configuration of QoS resources of a ++// container. ++message ContainerQoSResources { ++ // QoS resources the container will be assigned to. + map classes = 1; +} ``` @@ -765,8 +764,8 @@ support updates because of runtime limitations, yet. ... // resources to update or other options to use when updating the container. map annotations = 4; -+ // Configuration of class resources. -+ ContainerClassResources class_resources = 5; ++ // Configuration of QoS resources. ++ ContainerQoSResources qos_resources = 5; } ``` @@ -783,16 +782,14 @@ assignments at sandbox creation time (`RunPodSandboxRequest`). LinuxPodSandboxConfig linux = 8; // Optional configurations specific to Windows hosts. WindowsPodSandboxConfig windows = 9; -+ // Configuration of QoS-class resources. -+ PodClassResources class_resources = 10; ++ // Configuration of QoS resources. ++ PodQoSResources qos_resources = 10; } -+// PodClassResources specifies the configuration of QoS-class resources -+// resources of a pod. -+message PodClassResources { -+ // QoS-class resource assignment of the pod. -+ // Key is the resource type and values is the class name within the resource type. -+ map class = 1; ++// PodQoSResources specifies the configuration of QoS resources of a pod. ++message PodQoSResources { ++ // QoS resources the pod will be assigned to. ++ map classes = 1; +} ``` @@ -816,26 +813,27 @@ this information into node status. +// ResourcesInfo contains information about the resources discovered by the +// runtime. +message ResourcesInfo { -+ // Pod-level QoS-class resources available. -+ repeated ClassResourceInfo pod_class_resources = 1; -+ // Container-level QoS-class resources available. -+ repeated ClassResourceInfo container_class_resources = 2; ++ // Pod-level QoS resources available. ++ repeated QoSResourceInfo pod_qos_resources = 1; ++ // Container-level QoS resources available. ++ repeated QoSResourceInfo container_qos_resources = 2; +} -+// ClassResourceInfo contains information about one type of class resource. -+message ClassResourceInfo { -+ // Name of the QoS-class resources. ++// QoSResourceInfo contains information about one type of QoS resource. ++message QoSResourceInfo { ++ // Name of the QoS resources. + string Name = 1; + // Mutable is set to true if this resource sypports dynamic updates. + bool Mutable = 2; -+ // List of classes of this resource. -+ repeated ClassResourceClassInfo classes = 3; ++ // List of classes of this QoS resource. ++ repeated QoSResourceClassInfo classes = 3; +} -+// ClassResourceClassInfo contains information about single class of one -+// QoS-class resource type. -+message ClassResourceClassInfo { -+ string Name = 1; ++// QoSResourceClassInfo contains information about one class of certain ++// QoS resource. ++message QoSResourceClassInfo { ++ // Name of the class ++ string name = 1; } ``` @@ -846,10 +844,10 @@ runtime implementations: ```diff +const ( -+ // ClassResourceRdt is the name of the RDT QoS-class resource -+ ClassResourceRdt = "rdt" -+ // ClassResourceBlockio is the name of the blockio QoS-class resource -+ ClassResourceBlockio = "blockio" ++ // QoSResourceRdt is the name of the RDT QoS-class resource ++ QoSResourceRdt = "rdt" ++ // QoSResourceBlockio is the name of the blockio QoS-class resource ++ QoSResourceBlockio = "blockio" +) ``` @@ -877,7 +875,7 @@ basic support in the container runtimes. ### Kubelet Kubelet will interpret the specific [pod annotations](#pod-annotations) and -translate them into corresponding `ClassResources` data in the CRI +translate them into corresponding `QoSResources` data in the CRI ContainerConfig message at container creation time (CreateContainerRequest). Pod-level QoS-class resources are not supported at this point (via pod annotations). @@ -888,7 +886,7 @@ Resources field in RuntimeStatus message). An admission handler is added to kubelet to validate the QoS-class resource request against the resource availability on the node. Pod is rejected if sufficient resources do not exist. -A feature gate ClassResources enables kubelet to interpretthe specific pod +A feature gate QoSResources enables kubelet to interpretthe specific pod annotations. If the feature gate is disabled the annotations are simply ignored by kubelet. @@ -999,7 +997,7 @@ phase if it is not desired in the first round of API changes. Add capacity both to resource discovery. ```diff - message ClassResourceClassInfo { + message QoSResourceClassInfo { // Name of the class string name = 1; + // Capacity is the number of maximum allowed simultaneous assignments into this class @@ -1015,7 +1013,7 @@ Corresponding to the CRI changes, class capacity would be added to the class info in NodeStatus. ```diff - type ClassResourceClassInfo struct { + type QoSResourceClassInfo struct { // Name of the class. Name string + // Capacity is the number of maximum allowed simultaneous assignments into this class @@ -1305,7 +1303,7 @@ well as the [existing list] of feature gates. --> - [x] Feature gate (also fill in values in `kep.yaml`) - - Feature gate name: ClassResources + - Feature gate name: QoSResources - Components depending on the feature gate: - Implementation Phase 1: - kubelet From 5890598c902d95620ca1ca2b0daaa62838a47fde Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Fri, 6 Jan 2023 12:43:02 +0200 Subject: [PATCH 53/92] KEP-3008: extend CRI ContainerStatus --- keps/sig-node/3008-qos-class-resources/README.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index e626a9bf070..e145fc92379 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -102,6 +102,7 @@ tags, and then generate with `hack/update-toc.sh`. - [ContainerConfig](#containerconfig) - [UpdateContainerResourcesRequest](#updatecontainerresourcesrequest) - [PodSandboxConfig](#podsandboxconfig) + - [ContainerStatus](#containerstatus) - [RuntimeStatus](#runtimestatus) - [Consts](#consts) - [Pod annotations](#pod-annotations) @@ -793,6 +794,21 @@ assignments at sandbox creation time (`RunPodSandboxRequest`). +} ``` +#### ContainerStatus + +The `ContainerResources` message (part of `ContainerStatus`) will be extended +to report back QoS-class resource assignments of a container, similar to other +resources. + +```diff +@@ -1251,6 +1269,8 @@ message ContainerResources { + LinuxContainerResources linux = 1; + // Resource limits configuration specific to Windows container. + WindowsContainerResources windows = 2; ++ // Configuration of QoS resources. ++ ContainerQoSResources qos_resources = 3; +``` + #### RuntimeStatus Extend the `RuntimeStatus` message with new `resources` field that is used to From 4fbce4e42aaf8639f4eaa7978b312b3bda379129 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Mon, 16 Jan 2023 14:28:57 +0200 Subject: [PATCH 54/92] KEP-3008: ditch Pod annotations as an initial UX Take K8s API in the first implementation phase. --- .../3008-qos-class-resources/README.md | 441 ++++++++++-------- .../3008-qos-class-resources/design.svg | 2 +- 2 files changed, 239 insertions(+), 204 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index e145fc92379..37347360173 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -85,9 +85,8 @@ tags, and then generate with `hack/update-toc.sh`. - [Implementation phases](#implementation-phases) - [Phase 1](#phase-1) - [Future work](#future-work) - - [Pod Spec](#pod-spec) - [Update sandbox-level QoS-class resources](#update-sandbox-level-qos-class-resources) - - [Resource status/capacity](#resource-statuscapacity) + - [In-place pod vertical scaling](#in-place-pod-vertical-scaling) - [Access control](#access-control) - [Proposal](#proposal) - [User Stories (Optional)](#user-stories-optional) @@ -98,16 +97,20 @@ tags, and then generate with `hack/update-toc.sh`. - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) - - [CRI protocol](#cri-protocol) + - [CRI API](#cri-api) - [ContainerConfig](#containerconfig) - [UpdateContainerResourcesRequest](#updatecontainerresourcesrequest) - [PodSandboxConfig](#podsandboxconfig) - [ContainerStatus](#containerstatus) - [RuntimeStatus](#runtimestatus) - [Consts](#consts) - - [Pod annotations](#pod-annotations) + - [Kubernetes API](#kubernetes-api) + - [PodSpec](#podspec) + - [NodeStatus](#nodestatus) - [Kubelet](#kubelet) - [API server](#api-server) + - [Scheduler](#scheduler) + - [Kubectl](#kubectl) - [Container runtimes](#container-runtimes) - [Open Questions](#open-questions) - [Class capacity](#class-capacity) @@ -135,7 +138,9 @@ tags, and then generate with `hack/update-toc.sh`. - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) - - [Pod spec](#pod-spec-1) + - [Pod annotations](#pod-annotations) + - [Kubelet](#kubelet-1) + - [API server](#api-server-1) - [RDT-only](#rdt-only) - [Widen the scope](#widen-the-scope) - [Infrastructure Needed (Optional)](#infrastructure-needed-optional) @@ -305,8 +310,8 @@ know that this has succeeded? - Make QoS-class resources opqaue (as possible) to the CRI client - Discovery of the available QoS-class resources - API changes to support updating Pod-level (sandbox-level) QoS-class resource - assignment of running pods ([future work](#future-work)) -- Resource status/capacity ([future work](#future-work)) + assignment of running pods +- Resource status/capacity - Access control ([future work](#future-work)) ### Non-Goals @@ -323,31 +328,30 @@ and make progress. ## Implementation phases This proposal splits the full implementation of QoS-class resources into -multiple phases, building functionality gradually, step-by-step. The goal is to +multiple phases, building functionality gradually. The goal is to make the discussions more focused and easier. We may also learn on the way, insights from earlier phases affecting design choices made in the later phases, hopefully resulting in a better overall end result. However, we also outline all the future steps to not lose the overall big picture. -The figure below illustrates the design of the full implementation (less quota) -and the division of implementation phases. The first implementation phase -basically covers the communication between kubelet and the container runtime -(i.e. CRI API). All changes to the Kubernetes API and its control plane -components are left to future work. This KEP (the [Proposal](#proposal) and +The figure below illustrates the design of implementation phase 1. The first +implementation phase covers the communication between kubelet and the container +runtime (i.e. CRI API) and Kubernetes API for NodeStatus and PodSpec and their +respecive implementations (kubelet and kube-scheduler). Access control (e.g. +resource quota) is left to future work. This KEP (the [Proposal](#proposal) and [Design Details](#design-details)) in its current form implements this first -phase – the KEP will evolve and be supplemented with future phases getting -implemented. +phase. ![design](./design.svg) -In the current design QoS-class resources are designed to be opaque to the CRI -client in the sense that the container runtime takes care of configuration and -control of the resources and the classes within. +QoS-class resources are designed to be opaque to the CRI client in the sense +that the container runtime takes care of configuration and control of the +resources and the classes within. ### Phase 1 -The goal is to enable a bare minimum for users to leverage QoS-class resources -and start experimenting with them in Kubernetes: +The goal is to enable a functional base for users to leverage QoS-class +resources and start experimenting with them in Kubernetes: - extend the CRI protocol to allow QoS-class resource assignment and updates to be communicated from kubelet to the runtime @@ -355,6 +359,9 @@ and start experimenting with them in Kubernetes: resources (the types of resources and the classes within) to kubelet - implement pod annotations as an initial user interface - introduce a feature gate for enabling QoS-class resource support in kubelet +- extend PodSpec to support assignment of QoS-clsss resources +- extend NodeStatus to show availability/capacity of QoS-clsss resources on a + node ### Future work @@ -366,66 +373,6 @@ are currently listed as "future work" in [Goals](#goals). In practice, the future work mostly consists of changes to the Kubernetes API and control plane components. -#### Pod Spec - -This future step will replace pod annotations with proper user interface in the -Kubernetes API, i.e. PodSpec. Below, one possible option is presented. - -Introduce a new field (e.g. class) into ResourceRequirements of Container. - -```diff -// ResourceRequirements describes the compute resource requirements. -type ResourceRequirements struct { - // Limits describes the maximum amount of compute resources allowed. - Limits ResourceList `json:"limits,omitempty" - // Requests describes the minimum amount of compute resources required. - Requests ResourceList `json:"requests,omitempty" -+ // QoSResources specifies the QoS-class resources that the container should be assigned -+ QoSResources map[QoSResourceName]string -} - -+// QoSResourceName is the name of a QoS-class resource. -+type QoSResourceName string -``` - -Also, we add a `Resources` field to the `PodSpec`. We will re-use the existing -`ResourceRequirements` type but Limits and Requests must be left empty. QoSResources -may be set and they represent the Pod-level assignment of QoS-class resources, -comparable to the PodQoSResources message in PodSandboxConfig in the CRI API. - -```diff - type PodSpec struct { -@@ -224,4 +224,8 @@ type PodSpec struct { - // Default to false. - // +optional - SetHostnameAsFQDN *bool `json:"setHostnameAsFQDN,omitempty" protobuf:"varint,35,opt,name=setHostnameAsFQDN"` -+ // Pod-level resources. Currently, requests and limits are not allowed -+ // to be specified for pods. -+ // +optional -+ Resources ResourceRequirements - } -``` - -There is already an ongoing effort to add [Pod level resource limits][kep-2837] -that aims at adding a pod level `Resources` field in a similar fashion. - -In practice, the QoS-class resource information will be directly used in the CRI -ContainerConfig (e.g. CreateContainerRequest message). At this point, without -resource discovery or access control kubelet does not do any validity checking -of the values. Invalid class assignments will cause an error in the container -runtime which causes the corresponding CRI RuntimeService request (e.g. -RunPodSandbox or CreateContainer) to fail with an error. - -This phase would likely also wire QoS-class resources to -[In-place pod vertical scaling](#1287), allowing updates of running containers. - -Input validation of classes very similar to labels is implemented: keys -(`QoSResourceName`) and values must be non-empty, less than 64 characters -long, must start and end with an alphanumeric character and may contain only -alphanumeric characters, dashes, underscores or dots (`-`, `_` or `.`). -Similar to labels, a namespace prefix (FQDN subdomain separated with a slash) -in the key is allowed, similar to labels, e.g. `vendor/resource`. - #### Update sandbox-level QoS-class resources This future step would be a second extension to the CRI API. @@ -447,63 +394,12 @@ This will likely require a new API endpoint in CRI: + rpc UpdatePodSandboxConfig(UpdatePodSandboxConfigŔequest) returns (UpdatePodSandboxConfigŔesponse) {} ``` -#### Resource status/capacity - -This future step will add support for representing information about the -available QoS-class resource types (and the classes within each resource type). -This is important for the end users (to see what is available for the pods and -containers to consume) and also an enabler for scheduler support. - -Some alternatives for presenting this information: - -1. Supplement `NodeStatus` - - ```diff - // NodeStatus is information about the current status of a node. - type NodeStatus struct { - // Capacity represents the total resources of a node. - // More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#capacity - // +optional - Capacity ResourceList `json:"capacity,omitempty" protobuf:"bytes,1,rep,name=capacity,casttype=ResourceList,castkey=ResourceName"` - // Allocatable represents the resources of a node that are available for scheduling. - // Defaults to Capacity. - // +optional - Allocatable ResourceList `json:"allocatable,omitempty" protobuf:"bytes,2,rep,name=allocatable,casttype=ResourceList,castkey=ResourceName"` - + // QoSResources contains information about the class resources that are - + // available on the node. - + // +optional - + QoSResources QoSResourceStatus - ... - - +// QoSResourceStatus describes class resources available on the node. - +type QoSResourceStatus struct { - + // PodQoSResources contains the class resources that are available for - + // pods to be assigned to. - + PodQoSResources []QoSResourceInfo - + // ContainerQoSResources contains the class resources that are available - + // for containers to be assigned to. - + ContainerQoSResources []QoSResourceInfo - +} - - +// QoSResourceInfo contains information about one class resource type. - +type QoSResourceInfo struct { - + // Name of the resource. - + Name QoSResourceName - + // Mutable is set to true if the resource supports in-place updates. - + Mutable bool - + // Classes available for assignment. - + Classes []QoSResourceClassInfo - +} - - +// QoSResourceClassInfo contains information about single class of one - +// QoS-class resource. - +type QoSResourceClassInfo struct { - + Name string - +} - ``` -1. Separate API objects (e.g. something like `RuntimeClass`). Doesn't - necessarily that neatly align with two level hierarchy (resource name and a - set of classes within). Also, only best suited to homogenous clusters. +#### In-place pod vertical scaling + +Properly integrating QoS-class resources to +[In-place pod vertical scaling](#1287) will allow update of running containers. +However, some QoS-class resources may be immutable, meaning the in-place +updates are not possible. #### Access control @@ -587,6 +483,17 @@ We extend the CRI protocol to contain information about the QoS-class resource assignment of containers and pods. Resource assignment requests will be simple key-value pairs (*resource-type=class-name*) +We also extend the CRI protocol to support updates of QoS-class resource +assignment of running containers. We recognize that currently container +runtimes lack the capability to update either of the two types of QoS-class +resources we have identified (RDT and blockio). However, there is no technical +limitation in that and we are planning to implement update support for them +in the future. + +We also extend the Kubernetes API to support the assignment of QoS-class +resources to applications via PodSpec, and to users and the kube-scheduler to +see availability of QoS-class resources on a per-node basis via NodeStatus. + Container runtime is expected to be aware of all resource types and the classes within. The CRI protocol is extended to be able to communicate the available QoS-class resources from the runtime to the client. This information includes: @@ -615,19 +522,8 @@ pod-level QoS-class resources but we see usage scenarios for those in the future (communicating the pod QoS class to the runtime and enabling pod-level cgroup controls for blockio). -We also extend the CRI protocol to support updates of QoS-class resource -assignment of running containers. We recognize that currently container -runtimes lack the capability to update either of the two types of QoS-class -resources we have identified (RDT and blockio). However, there is no technical -limitation in that and we are planning to implement update support for them -in the future. - -We implement pod annotations the initial mechanism for Kubernetes users to -control QoS-class resource assignment. We define two container-level QoS-class -resources that can be controlled via annotations, i.e. RDT and blockio. - -We introduce a feature gate that enables kubelet to interpret pod annotations -for controlling the RDT and blockio class of containers. +We introduce a feature gate that enables Kubernetes components (kubelet, +kube-apiserver, kube-scheduler) to support QoS-class resources. ### User Stories (Optional) @@ -668,14 +564,13 @@ Go in to as much detail as necessary here. This might be a good place to talk about core concepts and how they relate. --> -Implementation Phase 1 is only the first step in getting QoS-class resources -supported in Kubernetes. Important pieces like resource assignment via pod -spec, resource status and permission control are [future work](#future-work) -not fully solved here. The risk in this sort of piecemeal -approach is finding devil in the details, resulting in inconsistent and/or -crippled and/or cumbersome end result. However, there is a lot of experience in -extending the API and understanding which sort of solutions are functional and -practical. +The proposal only describes the details of implementation Phase 1 and some +parts like permission control are not covered. The +risk in this piecemeal approach is finding devil in the details, resulting in +inconsistent and/or crippled and/or cumbersome end result. However, the missing +parts are already outlined in [future work](#future-work) and there is aa lot +of experience in extending the API and understanding which sort of solutions +are functional and practical, mitigating the problem. ### Risks and Mitigations @@ -694,9 +589,6 @@ Consider including folks who also work outside the SIG or subproject. - User assigning container to “unauthorized” class, causing interference and access to unwanted set/amount of resources. This will be addressed in future KEP introducing permission controls. -- Confusion: user tries to assign container to RDT class but RDT has not been - enabled on system(s). This will be addressed by future KEP(s) introducing - resource availability status. ## Design Details @@ -715,7 +607,7 @@ Configuration and management of the QoS-class resources is fully handled by the underlying container runtime and is invisible to kubelet. An error to the CRI client is returned if the specified class is not available. -### CRI protocol +### CRI API The following additions to the CRI protocol are suggested. @@ -867,40 +759,130 @@ runtime implementations: +) ``` -### Pod annotations +### Kubernetes API -Use Pod annotation as the initial K8s user interface, similar to e.g. how -seccomp support was added. This will bridge the gap between the first -implementation phase, i.e. enabling QoS-class resources in the CRI protocol, -and the future work which makes them available in the Pod spec. +The PodSpec will be extended to support assignment of pod-level and +container-level QoS-class devices. NodeStatus will be extended to include +information about the available QoS-class resources on a node. -Specifically, annotations for specifying RDT and blockio class will be -supported. These are the two types of QoS-class resources that already have -basic support in the container runtimes. +#### PodSpec -- `rdt.resources.alpha.kubernetes.io/default` for setting a Pod-level default RDT - class for all containers -- `rdt.resources.alpha.kubernetes.io/container.` for - container-specific RDT class settings - blockio class for all containers -- `blockio.resources.alpha.kubernetes.io/default` for setting a Pod-level default - blockio class for all containers -- `blockio.resources.alpha.kubernetes.io/container.` for - container-specific blockio class settings +We introduce a new field, QoSResources into the existing ResourceRequirements +struct. This will enable the assignment of QoS resources for containers. + +```diff + type ResourceRequirements struct { + // Limits describes the maximum amount of compute resources allowed. +@@ -2195,6 +2198,10 @@ type ResourceRequirements struct { + // +featureGate=DynamicResourceAllocation + // +optional + Claims []ResourceClaim ++ // QoSResources specifies the QoS resources. ++ // +optional ++ QoSResources map[QoSResourceName]string + } + ++// QoSResourceName is the name of a QoS resource. ++type QoSResourceName string +``` + +Also, we add a Resources field to the PodSpec, to enable assignment of +pod-level QoS resources. We will re-use the existing ResourceRequirements type +but Limits and Requests and Claims must be left empty. QoSResources may be set +and they represent the Pod-level assignment of QoS-class resources, +corresponding the PodQoSResources message in PodSandboxConfig in the CRI +API. + +```diff + type PodSpec struct { +@@ -3062,6 +3069,10 @@ type PodSpec struct { + ResourceClaims []PodResourceClaim ++ // Pod-level resources. Claims, requests and limits are not allowed ++ // to be specified for pods. ++ // +optional ++ Resources ResourceRequirements + } +``` + +There is already an ongoing effort to add [Pod level resource limits][kep-2837] +that aims at adding a pod level `Resources` field in a similar fashion. Thus, +we opt for adding ResourceRequirements insted of QoSResources directly into the +PodSpec. + +#### NodeStatus + +We extend NodeStatus to list available QoS-class resources on a node, This +consists of the list of the available QoS-class resource types and the classes +available within each of these resource types. + + + +```diff + type NodeStatus struct { +@@ -4444,6 +4482,11 @@ type NodeStatus struct { + // Status of the config assigned to the node via the dynamic Kubelet config feature. + Config *NodeConfigStatus ++ // QoSResources contains information about the QoS resources that are ++ // available on the node. ++ // +featureGate=QoSResources ++ // +optional ++ QoSResources QoSResourceStatus + } + +... + ++// QoSResourceStatus describes QoS resources available on the node. ++type QoSResourceStatus struct { ++ // PodQoSResources contains the QoS resources that are available for pods ++ // to be assigned to. ++ PodQoSResources []QoSResourceInfo ++ // ContainerQoSResources contains the QoS resources that are available for ++ // containers to be assigned to. ++ ContainerQoSResources []QoSResourceInfo ++} + ++// QoSResourceInfo contains information about one QoS resource type. ++type QoSResourceInfo struct { ++ // Name of the resource. ++ Name QoSResourceName ++ // Mutable is set to true if the resource supports in-place updates. ++ Mutable bool ++ // Classes available for assignment. ++ Classes []QoSResourceClassInfo ++} + ++// QoSResourceClassInfo contains information about single class of one QoS ++// resource. ++type QoSResourceClassInfo struct { ++ // Name of the class. ++ Name string ++} +``` ### Kubelet -Kubelet will interpret the specific [pod annotations](#pod-annotations) and -translate them into corresponding `QoSResources` data in the CRI -ContainerConfig message at container creation time (CreateContainerRequest). -Pod-level QoS-class resources are not supported at this point (via pod -annotations). +Kubelet gets QoS-class resource assignment from the [PodSpec](#podspec) and +translates these into corresponding `QoSResources` data in the CRI API. This is +ContainerConfig message at container creation time (CreateContainerRequest) and +PodSandboxConfig at sandbox creation time (RunPodSandboxRequest). In practice, +there is no translation, just copying key-value pairse. Kubelet will receive the information about available QoS-class resources (the types of reqources and their classes) from the runtime over the CRI API (new -Resources field in RuntimeStatus message). An admission handler is added to -kubelet to validate the QoS-class resource request against the resource -availability on the node. Pod is rejected if sufficient resources do not exist. +Resources field in [RuntimeStatus](#runtimestatus) message). The kubelet +updates the new QoSResources field in [NodeStatus](#nodestatus) accordingly, +making QoS-class resources on the node visible to users and the kube-scheduler. + +An admission handler is added into kubelet to validate the QoS-class resource +request against the resource availability on the node. Pod is rejected if +sufficient resources do not exist. + +Invalid class assignments will cause an error in the container runtime which +causes the corresponding CRI RuntimeService request (e.g. RunPodSandbox or +CreateContainer) to fail with an error. This can happen, despite the admission +check, e.g. when the requested QoS-class resource is not available on the +runtime (e.g. kubelet sends the pod or container start request before getting +an update of changed QoS-class resource availability from the runtime). A feature gate QoSResources enables kubelet to interpretthe specific pod annotations. If the feature gate is disabled the annotations are simply ignored @@ -908,12 +890,30 @@ by kubelet. ### API server -A validation check (core api validation) is added in the API server to reject -changes to the QoS-class resource specific [pod annotations](#pod-annotations) -after a Pod has been created. This ensures that the annotations always reflect -the actual assignment of QoS-class resources of a Pod. It also serves as part -of the UX to indicate the in-place updates of the resources via annotations is -not supported. +Input validation of QoS-class resource names and class names, very similar to +labels is implemented: keys (`QoSResourceName`) and values must be non-empty, +less than 64 characters long, must start and end with an alphanumeric character +and may contain only alphanumeric characters, dashes, underscores or dots (`-`, +`_` or `.`). Also similar to labels, a namespace prefix (FQDN subdomain separated +with a slash) in the key is allowed (e.g. `vendor/qos-resource`). + +### Scheduler + +The kube-scheduler will be extended to do node pre-filtering based on the +QoS-class resource requests of a Pod. If any of the requested QoS-class +resources are unavailable on a node, that node is not fit. If no nodes can +satisfy all of the requests the Pod is marked as unschedulable and will be +scheduled in the future when/if requests can be satisfied by some node. + +In principle, scheduling will behave similarly to native and extended +resources. + +### Kubectl + +The kubectl describe command will be extended to: + +- available QoS-class resources of nodes +- display QoS-class resource requests of pods ### Container runtimes @@ -1127,6 +1127,7 @@ extending the production code to implement this enhancement. - `k8s.io/kubernetes/pkg/kubelet/kuberuntime`: `2022-06-13` - `66.8%` - `k8s.io/kubernetes/pkg/apis/core/validation/validation.go`: `2022-06-13` - `82.1%` +- `k8s.io/kubernetes/pkg/scheduler` ##### Integration tests @@ -1138,7 +1139,8 @@ For Beta and GA, add links to added tests together with links to k8s-triage for https://storage.googleapis.com/k8s-triage/index.html --> -Alpha: no specific integration tests are planned for Alpha. +Alpha: Existing integration tests for kubelet and kube-scheduler will be +extended to cover QoS-class resources. Beta: Existing integration tests for affected components (e.g. scheduler, node status, quota) are extended to cover QoS-class resources. @@ -1229,10 +1231,9 @@ in back-to-back releases. #### Beta - Gather feedback from developers and surveys -- In addition to the changes in CRI API, implement the following - - Pod spec update - - Resource status/capacity (with scheduling) - - Parmission control +- In addition to the initial changes in CRI API, implement the following + - Extend CRI API to support updating sandbox-level QoS-class resources + - Parmission control (resource quota) - Well-defined behavior with [In-place pod vertical scaling](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources) - Additional tests are in Testgrid and linked in KEP - User documentation is available @@ -1753,12 +1754,46 @@ not need to be as detailed as the proposal, but should include enough information to express the idea and why it was not acceptable. --> -### Pod spec +### Pod annotations + +Instead of updating CRI and Kubernetes API in lock-step, the API changes could +be split into two phases. Similar to e.g. how seccomp support was added, Pod +annotations could be used as an initial user interface before the Kubernetes +API changes are merged. + +Specifically, annotations for specifying RDT and blockio class would be +supported. These are the two types of QoS-class resources that already have +basic support in the container runtimes. + +- `rdt.resources.alpha.kubernetes.io/default` for setting a Pod-level default RDT + class for all containers +- `rdt.resources.alpha.kubernetes.io/container.` for + container-specific RDT class settings + blockio class for all containers +- `blockio.resources.alpha.kubernetes.io/default` for setting a Pod-level default + blockio class for all containers +- `blockio.resources.alpha.kubernetes.io/container.` for + container-specific blockio class settings + +#### Kubelet + +Kubelet would interpret the specific [pod annotations](#pod-annotations) and +translate them into corresponding `QoSResources` data in the CRI +ContainerConfig message at container creation time (CreateContainerRequest). +Pod-level QoS-class would not supported at this point (via pod annotations). + +A feature gate QoSResources would enable kubelet to interpretthe specific pod +annotations. If the feature gate is disabled the annotations would simply be +ignored by kubelet. + +#### API server -Instead of introducing Pod annotations as an intermediate solution for -controlling the QoS-class resources, the Pod spec could be updated in lock-step -with the CRI api. See the section [(Future work) Pod spec](#pod-spec) for more -details. +A validation check (core api validation) would added in the API server to +reject changes to the QoS-class resource specific +[pod annotations](#pod-annotations) after a Pod has been created. This would +ensure that the annotations always reflect the actual assignment of QoS-class +resources of a Pod. It also would serve as part of the UX to indicate the +in-place updates of the resources via annotations is not supported. ### RDT-only diff --git a/keps/sig-node/3008-qos-class-resources/design.svg b/keps/sig-node/3008-qos-class-resources/design.svg index f632363e200..309a257965f 100644 --- a/keps/sig-node/3008-qos-class-resources/design.svg +++ b/keps/sig-node/3008-qos-class-resources/design.svg @@ -1 +1 @@ -PodSpecNode XAPI serverSchedulerruntimesystemkubeletnode Xenforce QoS-class resource assignmentinitialize QoS-classresourcesannounceQoS-class resourcespod is createdfind suitable nodepod is scheduledupdate node statuspod & containerscreated45637218implementation detailsphase 1future phasesresources:classes:qos-res-a: goldqos-res-c: hi-priocapacity:qos-res-a:[gold, silver, bronze]qos-res-b:[red, yellow, green]qos-res-c:[hi-prio, lo-prio] \ No newline at end of file +PodSpecNode XAPI serverSchedulerruntimesystemkubeletnode Xenforce QoS-class resource assignmentinitialize QoS-classresourcesannounceQoS-class resourcespod is createdfind suitable nodepod is scheduledupdate node statuspod & containerscreated45637218implementation detailsresources:classes:qos-res-a: goldqos-res-c: hi-priocapacity:qos-res-a:[gold, silver, bronze]qos-res-b:[red, yellow, green]qos-res-c:[hi-prio, lo-prio] \ No newline at end of file From b89c14b07add31d4a9d88580bba538bbdf490da9 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Mon, 16 Jan 2023 19:56:44 +0200 Subject: [PATCH 55/92] KEP-3008: add class capacity to the proposal Support limiting the allowed simultaneous assignments of classes on a node. --- .../3008-qos-class-resources/README.md | 192 ++++++------------ 1 file changed, 59 insertions(+), 133 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 37347360173..990b1267e23 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -113,9 +113,6 @@ tags, and then generate with `hack/update-toc.sh`. - [Kubectl](#kubectl) - [Container runtimes](#container-runtimes) - [Open Questions](#open-questions) - - [Class capacity](#class-capacity) - - [Class capacity with extended resources](#class-capacity-with-extended-resources) - - [Class capacity in the API](#class-capacity-in-the-api) - [Pod QoS class](#pod-qos-class) - [Default class](#default-class) - [Test Plan](#test-plan) @@ -141,6 +138,7 @@ tags, and then generate with `hack/update-toc.sh`. - [Pod annotations](#pod-annotations) - [Kubelet](#kubelet-1) - [API server](#api-server-1) + - [Class capacity with extended resources](#class-capacity-with-extended-resources) - [RDT-only](#rdt-only) - [Widen the scope](#widen-the-scope) - [Infrastructure Needed (Optional)](#infrastructure-needed-optional) @@ -503,13 +501,10 @@ QoS-class resources from the runtime to the client. This information includes: In-place updates of resoures might not be possible because of runtime limitations or the underlying technology, for example. -QoS-class resources are unbounded in the way that any number of applications -can request the same class (of the same QoS-class resource). QoS-class -resources typically represent some throttling mechanism - in this case for -example, if all (or most of) applications request the highest-tier class all of -these applications get the same level of service. -Future work on [access control(#access-control) will provide mechanisms to -limit what QoS-class resources (or classes) are available to users. +QoS-class resources may be bounded in the way that the number of applications +that can be assigned to a specific class (of one QoS-class resource) on a node +can be limited. This limit is configuratble on a per-class (and per-node) +basis. This can be used to e.g. limit access to a high-tier class. Pod-level and container-level QoS-class resources are completely independent resource types. E.g. specifying something in the pod-level request does not @@ -742,6 +737,9 @@ this information into node status. +message QoSResourceClassInfo { + // Name of the class + string name = 1; ++ // Capacity is the number of maximum allowed simultaneous assignments into this class ++ // Zero means "infinite" capacity i.e. the usage is not restricted. ++ uint64 capacity = 2; } ``` @@ -856,6 +854,10 @@ available within each of these resource types. +type QoSResourceClassInfo struct { + // Name of the class. + Name string ++ // Capacity is the number of maximum allowed simultaneous assignments into this class ++ // Zero means "infinite" capacity i.e. the usage is not restricted ++ // +optional ++ Capacity int64 +} ``` @@ -906,7 +908,8 @@ satisfy all of the requests the Pod is marked as unschedulable and will be scheduled in the future when/if requests can be satisfied by some node. In principle, scheduling will behave similarly to native and extended -resources. +resources. The kube-scheduler will also take in account the per-class capacity +on the nodes if that is provided. ### Kubectl @@ -936,128 +939,6 @@ Container runtimes will be updated to support the ### Open Questions -#### Class capacity - -Resource quota provides a way to restrict access to certain types of QoS-class -resources and their classes. However, the current design does not contain any -mechanism for controlling the number of assigments into a certain class on -per-node or per-namespace basis. That is. say that "only 2 containers can be -assigned to RDT class *gold* on this node". - -This concept was deliberately left out of the proposal in order to reduce -confusion with other countable resources (native resources, extended -resources). Also, capacity does not often fully fit together with QoS -prioritization: generally there needs to be at least one "unlimited" class as -each container or pod needs to belong to some class. - -However, there are at least two mechanisms how class capacity can be -implemented. The first is by leveraging extended resources and an admission -webhook to limit the usage of classes. The other alternative is to implement -class capacity in the API directly. - -##### Class capacity with extended resources - -It would be possible to implement "class capacity" by leveraging extended -resources and mutating admission webhooks: - -1. An extended resource with the desited capacity is created for each class - which needs to be controlled. A possible example: - ```plaintext - Allocatable: - qos/container/resource-X/class-A: 2 - qos/container/resource-X/class-B: 4 - qos/container/resource-Y/class-foo: 1 - ``` - - In this example *class-A* and *class-B* of QoS-class resource *resource-X* - would be limited to 2 and 4 assignments, respectively (on this node). - Similarly, class *class-foo* of QoS-class resource *resource-Y* would be - limited to only one assignment. - -2. A specific mutating admission webhook is deployed which syncs the specific - extended resources. That is, if a specific QoS-class resource assignment is - requested in the pod spec, a request for the corresponding extended resource - is automatically added. And vice versa: if a specific extended resource is - requested a QoS-class resource request is automatically put in place. - -However, this approach is quite involved, has shortcomings and caveats and thus -it is probably suitable only for targeted usage scenarios, not as a general -solution. Downsides include: - -- requires implementation of "side channel" control mechanisms, e.g. admission - webhook and some solution for capacity management (extended resources) -- deployment of admission webhooks is cumbersome -- management of capacity is limited and cumbersome - - management of extended resources needs a separate mechanism - - ugly to handle a scenario where *class-A* on *node-1* would be unlimited - but on *node-2* it would have some capacity limit (e.g. use a very high - integer capacity of the extended resource on *node-2*) - - dynamic changes are harder -- not well-suited for managing pod-level QoS-class resources as pod-level - resource requests (for the extended resource) are not supported in Kubernetes -- possible confusion for users regarding the double accounting (QoS-class - resources and extended resources) - -##### Class capacity in the API - -An alternative is to include the concept of "class capacity" in the APIs -directly. In this solution the per-node capacity would be controlled by the -runtime and visible in node status. Also resource quota could be extended to -limit and track capacity, giving per-namespace capacity control. - -Adding "class capacity" to the APIs could be a separate future implementation -phase if it is not desired in the first round of API changes. - -###### CRI - -Add capacity both to resource discovery. - -```diff - message QoSResourceClassInfo { - // Name of the class - string name = 1; -+ // Capacity is the number of maximum allowed simultaneous assignments into this class -+ // Zero means "infinite" capacity i.e. the usage is not restricted. -+ uint64 capacity = 2; -+ - } -``` - -###### NodeStatus - -Corresponding to the CRI changes, class capacity would be added to the class -info in NodeStatus. - -```diff - type QoSResourceClassInfo struct { - // Name of the class. - Name string -+ // Capacity is the number of maximum allowed simultaneous assignments into this class -+ // Zero means "infinite" capacity i.e. the usage is not restricted -+ // +optional -+ Capacity int64 - } -``` - -###### ResourceQuotaSpec - -ResourceQuota would be supplemented similarly. - -```diff - type AllowedClass struct { - // Name of the class. - Name string -+ // Capacity is the hard limit for usage of the class. -+ // +optional -+ Capacity int64 - } -``` - -It is worth noting that this change in ResourceQuota is independent from the -other. That is, resource quota could have "class capacity" even though there -wouldn't be such concept on the node status level. And vice versa: node status -could have "class capacity" without resource quota having this concept. - #### Pod QoS class The Pod QoS class could be communicated to the container runtime as a QoS-class @@ -1795,6 +1676,51 @@ ensure that the annotations always reflect the actual assignment of QoS-class resources of a Pod. It also would serve as part of the UX to indicate the in-place updates of the resources via annotations is not supported. +#### Class capacity with extended resources + +Support for class capacity could be left out of the proposal to simplify the +concept. It would be possible to implement "class capacity" by leveraging +extended resources and mutating admission webhooks: + +1. An extended resource with the desited capacity is created for each class + which needs to be controlled. A possible example: + ```plaintext + Allocatable: + qos/container/resource-X/class-A: 2 + qos/container/resource-X/class-B: 4 + qos/container/resource-Y/class-foo: 1 + ``` + + In this example *class-A* and *class-B* of QoS-class resource *resource-X* + would be limited to 2 and 4 assignments, respectively (on this node). + Similarly, class *class-foo* of QoS-class resource *resource-Y* would be + limited to only one assignment. + +2. A specific mutating admission webhook is deployed which syncs the specific + extended resources. That is, if a specific QoS-class resource assignment is + requested in the pod spec, a request for the corresponding extended resource + is automatically added. And vice versa: if a specific extended resource is + requested a QoS-class resource request is automatically put in place. + +However, this approach is quite involved, has shortcomings and caveats and thus +it is probably suitable only for targeted usage scenarios, not as a general +solution. Downsides include: + +- requires implementation of "side channel" control mechanisms, e.g. admission + webhook and some solution for capacity management (extended resources) +- deployment of admission webhooks is cumbersome +- management of capacity is limited and cumbersome + - management of extended resources needs a separate mechanism + - ugly to handle a scenario where *class-A* on *node-1* would be unlimited + but on *node-2* it would have some capacity limit (e.g. use a very high + integer capacity of the extended resource on *node-2*) + - dynamic changes are harder +- not well-suited for managing pod-level QoS-class resources as pod-level + resource requests (for the extended resource) are not supported in Kubernetes +- possible confusion for users regarding the double accounting (QoS-class + resources and extended resources) + + ### RDT-only The scope of the KEP could be narrowed down by concentrating on RDT only, From 9bba6a506386827ac6ba95b316259869cc4eb35b Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Mon, 16 Jan 2023 21:00:56 +0200 Subject: [PATCH 56/92] KEP-3008: update future work and clarify "Mutable" Update the section pondering on access control. Drop speculation on separate API objects. Add "class capacity" support to quota. Also clarify the Mutable field of QoSResourceInfo in the CRI API. --- keps/sig-node/3008-qos-class-resources/README.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 990b1267e23..0b3992605e8 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -404,11 +404,7 @@ updates are not possible. This future step adds support for controlling the access to available QoS-class resources. -If QoS-class resources were advertised as API objects the natural access -control mechanism would be through RBAC. - -If QoS-class resources were advertised in node status (similar to other resources), -access control could be achieved e.g. by extending ResourceQuotaSpec which +Access control could be achieved e.g. by extending ResourceQuotaSpec which would implement restrictions based on the namespace. ```diff @@ -445,6 +441,9 @@ would implement restrictions based on the namespace. + Name QoSResourceName + // Allowed classes. + Classes []string ++ // Capacity is the hard limit for usage of the class. ++ // +optional ++ Capacity int64 +} @@ -726,7 +725,8 @@ this information into node status. +message QoSResourceInfo { + // Name of the QoS resources. + string Name = 1; -+ // Mutable is set to true if this resource sypports dynamic updates. ++ // Mutable is set to true if this resource supports in-place updates i.e. ++ // the class of a running container or sandbox can be changed. + bool Mutable = 2; + // List of classes of this QoS resource. + repeated QoSResourceClassInfo classes = 3; From c0b288989baaf704092f3f6b2aacd9e1031728e0 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 31 Jan 2023 21:28:02 +0200 Subject: [PATCH 57/92] KEP-3008: address review feedback from msau42 --- .../3008-qos-class-resources/README.md | 30 +++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 0b3992605e8..aa28c8be205 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -226,7 +226,9 @@ aimed at enabling) are: With QoS-class resources, Pods and their containers can request opaque QoS-class identifiers (classes) of certain QoS mechanism (QoS-class resource type). Kubelet relays this information to the container runtime, without -directing how the request is enforced in the underlying system. +directing how the request is enforced in the underlying system. Being opaque to +Kubernetes means that QoS-class resources have to be supported by the container +runtime as it is responsible for the actual low-level management of them. ## Motivation @@ -355,7 +357,6 @@ resources and start experimenting with them in Kubernetes: be communicated from kubelet to the runtime - extend the CRI protocol to allow runtime to communicate available QoS-class resources (the types of resources and the classes within) to kubelet -- implement pod annotations as an initial user interface - introduce a feature gate for enabling QoS-class resource support in kubelet - extend PodSpec to support assignment of QoS-clsss resources - extend NodeStatus to show availability/capacity of QoS-clsss resources on a @@ -629,6 +630,8 @@ resource assignments to the runtime. +// container. +message ContainerQoSResources { + // QoS resources the container will be assigned to. ++ // Key-value pairs where key is name of the QoS resource and value is the ++ // name of the class. + map classes = 1; +} ``` @@ -676,6 +679,8 @@ assignments at sandbox creation time (`RunPodSandboxRequest`). +// PodQoSResources specifies the configuration of QoS resources of a pod. +message PodQoSResources { + // QoS resources the pod will be assigned to. ++ // Key-value pairs where key is name of the QoS resource and value is the ++ // name of the class. + map classes = 1; +} ``` @@ -807,6 +812,27 @@ that aims at adding a pod level `Resources` field in a similar fashion. Thus, we opt for adding ResourceRequirements insted of QoSResources directly into the PodSpec. +As an example, a Pod requesting class "fast" of a (exemplary) pod-level QoS +resource named "network", with one container requesting class "gold" of +container-level QoS resource named "rdt": + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: qos-resource-example +spec: + resources: + qosResources: + network: fast + containers: + - name: cnt + image: nginx + resources: + qosResources: + rdt: gold +``` + #### NodeStatus We extend NodeStatus to list available QoS-class resources on a node, This From 10033ee731a69b87d4de7fe6f1431855aa6bc3f6 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Wed, 1 Feb 2023 16:40:00 +0200 Subject: [PATCH 58/92] KEP-3008: stripped some of the comments from the README --- .../3008-qos-class-resources/README.md | 84 ------------------- 1 file changed, 84 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index aa28c8be205..befb3eed368 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -60,22 +60,6 @@ SIG Architecture for cross-cutting KEPs). --> # [KEP-3008](#3008): QoS-class resources - - - - - [Release Signoff Checklist](#release-signoff-checklist) - [Summary](#summary) @@ -188,25 +172,6 @@ Items marked with (R) are required *prior to targeting to a milestone / release* ## Summary - - Add support to Kubernetes for declaring _quality-of-service_ resources, and assigning these to Pods. A quality-of-service (QoS-class) resource is similar to other Kubernetes resource types (i.e. native resources such as `cpu` and @@ -232,15 +197,6 @@ runtime as it is responsible for the actual low-level management of them. ## Motivation - - This enhancement proposal aims at improving the quality of service of applications by making available a new type of resource control mechanism in Kubernetes. Certain types of resources are inherently shared by applications, @@ -294,11 +250,6 @@ kubelet to the container runtime. ### Goals - - - Make it possible to request QoS-class resources - Support RDT class assignment of containers. This is already supported by the containerd and CRI-O runtime and part of the OCI runtime-spec @@ -316,11 +267,6 @@ know that this has succeeded? ### Non-Goals - - - Interface or mechanism for configuring the QoS-class resources (responsibility of the container runtime). - Enumerating possible (QoS-class) resource types or their detailed behavior @@ -464,15 +410,6 @@ would implement restrictions based on the namespace. ``` ## Proposal - - This section currently covers [implementation phase 1](#phase-1) (see [implementation phases](#implementation-phases) for an outline of the complete implementation). @@ -522,13 +459,6 @@ kube-apiserver, kube-scheduler) to support QoS-class resources. ### User Stories (Optional) - - #### Story 1 As a user I want to minimize the interference of other applications to my @@ -552,13 +482,6 @@ the available memory bandwidth. ### Notes/Constraints/Caveats (Optional) - - The proposal only describes the details of implementation Phase 1 and some parts like permission control are not covered. The risk in this piecemeal approach is finding devil in the details, resulting in @@ -587,13 +510,6 @@ Consider including folks who also work outside the SIG or subproject. ## Design Details - - The detailed design presented here covers [implementation phase 1](#phase-1) (see [implementation phases](#implementation-phases) for an outline of all planned changes). From 77b0de411a4d3d388ac2397ad07ee0272ef2119f Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Wed, 1 Feb 2023 16:52:00 +0200 Subject: [PATCH 59/92] KEP-3008: update kep.yaml --- keps/sig-node/3008-qos-class-resources/kep.yaml | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/kep.yaml b/keps/sig-node/3008-qos-class-resources/kep.yaml index 749d2965f56..2b955f240c1 100644 --- a/keps/sig-node/3008-qos-class-resources/kep.yaml +++ b/keps/sig-node/3008-qos-class-resources/kep.yaml @@ -6,8 +6,10 @@ owning-sig: sig-node participating-sigs: [] status: provisional creation-date: 2021-10-07 -reviewers: [] -approvers: [] +reviewers: + - "@rphillips" +approvers: + - "@sig-node-leads" # The target maturity stage in the current dev cycle for this KEP. stage: alpha @@ -28,6 +30,7 @@ feature-gates: components: - kubelet - kube-apiserver + - kube-scheduler disable-supported: true # The following PRR answers are required at beta release From ea3f494de9fbdeadaeaa663596f85b2a866e6a8f Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 7 Feb 2023 12:56:53 +0200 Subject: [PATCH 60/92] KEP-3008: address feedback from SergeyKanzhelev - added LimitRanges in the future work section - added summary describing the properties/proposed implementation detaiuls of QoS resources in the design details section - made it clearer that QoS resources configuration can be dynamic - added scheduler improvements to future work section - added kubelet-initiated pod eviction to future work --- .../3008-qos-class-resources/README.md | 125 ++++++++++++++++-- 1 file changed, 112 insertions(+), 13 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index befb3eed368..8eee048aa3c 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -72,6 +72,9 @@ SIG Architecture for cross-cutting KEPs). - [Update sandbox-level QoS-class resources](#update-sandbox-level-qos-class-resources) - [In-place pod vertical scaling](#in-place-pod-vertical-scaling) - [Access control](#access-control) + - [Scheduler improvements](#scheduler-improvements) + - [Kubelet-initiated pod eviction](#kubelet-initiated-pod-eviction) + - [Default and limits](#default-and-limits) - [Proposal](#proposal) - [User Stories (Optional)](#user-stories-optional) - [Story 1](#story-1) @@ -405,9 +408,70 @@ would implement restrictions based on the namespace. + // +optional + QoSResources QoSResourceQuota } +``` + +#### Scheduler improvements + +The first implementation phase only adds basic filtering of nodes based on +QoS-class resources and node scoring is not altered. However, the relevant +scheduler plugins (e.g. NodeResourcesFit and NodeResourcesBalancedAllocation) +could be extended to do scoring based on the capacity, availability and usage +of QoS-class resources on the nodes. + +#### Kubelet-initiated pod eviction +QoS-class resources available on a node are dynamic in the sense that they may +change over the lifetime of the node. E.g. re-configuration of the container +runtime may make new types of QoS-class resources available, properties of +existing resources may changes (e.g. the set of available classes) or some +resources might be removed completely. It might be desirable that kubelet could +evicts running pod that request QoS-class resources that are no more available +on the node. This should be relatively straightforward to implement as kubelet +knows what QoS-class resources are available on the node and also monitors all +running pods. +#### Default and limits + +LimitRanges is a way to specify per-container resource constraints. They might +have limited applicability also in the context of QoS-class resources. In +particular, LimitRanges could be used for specifying defaults for QoS-class +resources. An additional possible usage would be to set per-pod limits on the +usage of container-level QoS-class resources. + +```diff + // LimitRangeItem defines a min/max usage limit for any resource that matches on kind + type LimitRangeItem struct { +@@ -5167,6 +5167,28 @@ type LimitRangeItem struct { + // MaxLimitRequestRatio represents the max burst value for the named resource + // +optional + MaxLimitRequestRatio ResourceList ++ // QoSResources specifies the limits for QoS resources. ++ QoSResources []LimitQoSResource ++} ++ ++// LimitQoSResource specifies limits of one QoS resources type. ++type LimitQoSResource struct { ++ // Name of the resource. ++ Name QoSResourceName ++ // Default specifies the default class to be assigned. ++ // +optional ++ Default string ++ // Max usage of classes ++ // +optional ++ Max []QoSResourceClassLimit ++} ++ ++// QoSResourceClassLimit specifies a limit for one class of a QoS resource. ++type QoSResourceClassLimit struct { ++ // Name of the class. ++ Name string ++ // Capacity is the limit for usage of the class. ++ Capacity int64 + } ``` + +Just using LimitRanges for specifying defaults could simplify the API. + ## Proposal This section currently covers [implementation phase 1](#phase-1) (see @@ -512,11 +576,36 @@ Consider including folks who also work outside the SIG or subproject. The detailed design presented here covers [implementation phase 1](#phase-1) (see [implementation phases](#implementation-phases) for an outline of all -planned changes). +planned changes). Configuration and management of the QoS-class resources is +fully handled by the underlying container runtime and is invisible to kubelet. + +Summary of the proposed design details: + +- QoS resources are opaque (just names) to kubernetes, configuration and + management of QoS resources is handled in the container runtime + - no "system reserved" (or equivalent) exists (considered as configuration + detail outside Kubernetes) + - no overprovisioning (or auto-promotion to free classes) exists (considered + as implementation detail of specific QoS-class resource, outside + Kubernetes) +- runtime advertises available QoS resources to kubelet + - QoS resources are dynamic i.e. can change during the lifetime of the node + - CRI [RuntimeStatus](#runtimestatus) message is used for carrying the information +- kubelet + - updates NodeStatus based on the available QoS-class resources advertised by + the runtimea + - relays QoS-class resource requests from PodSpec to the runtime at pod + startup via [ContainerConfig](#containerconfig) and + [PodSandboxConfig](#podsandboxconfig) messages + - implements admission handler that validates the availability of QoS-class resources + - pod eviction as a possible [future improvement](#kubelet-initiated-pod-eviction) +- scheduler + - simple node filtering based on the availability of QoS-class resources in + the first implementation phase (with possible + [future improvements](#scheduler-improvements)) + - pod priority based eviction is in effect +- quota, limits and defaults are [future work](#future-work) items -Configuration and management of the QoS-class resources is fully handled by the -underlying container runtime and is invisible to kubelet. An error to the CRI -client is returned if the specified class is not available. ### CRI API @@ -620,10 +709,9 @@ resources. Extend the `RuntimeStatus` message with new `resources` field that is used to communicate the available QoS-class resources from the runtime to the client. - -This information can be used by the client (kubelet) to validate QoS-class -resource assignments before starting a pod. In future steps kubelet will patch -this information into node status. +This information is dynamic i.e. the available QoS-class resources (or their +properties like available classes or per-class capacity) may change over time, +e.g. after re-configuration of the runtime. ```diff message RuntimeStatus { @@ -816,10 +904,12 @@ types of reqources and their classes) from the runtime over the CRI API (new Resources field in [RuntimeStatus](#runtimestatus) message). The kubelet updates the new QoSResources field in [NodeStatus](#nodestatus) accordingly, making QoS-class resources on the node visible to users and the kube-scheduler. +This information is dynamic, i.e. the available QoS-class resources (or their +properties) may change over time. An admission handler is added into kubelet to validate the QoS-class resource request against the resource availability on the node. Pod is rejected if -sufficient resources do not exist. +sufficient resources do not exist. This also applies to static pods. Invalid class assignments will cause an error in the container runtime which causes the corresponding CRI RuntimeService request (e.g. RunPodSandbox or @@ -828,9 +918,11 @@ check, e.g. when the requested QoS-class resource is not available on the runtime (e.g. kubelet sends the pod or container start request before getting an update of changed QoS-class resource availability from the runtime). -A feature gate QoSResources enables kubelet to interpretthe specific pod -annotations. If the feature gate is disabled the annotations are simply ignored -by kubelet. +No kubelet-initiated pod eviction is implemented in the first implementation +phase. + +A feature gate QoSResources enables kubelet to update the QoS-class resources +in NodeStatus and handle QoS-class resource requests in the PodSpec. ### API server @@ -851,7 +943,14 @@ scheduled in the future when/if requests can be satisfied by some node. In principle, scheduling will behave similarly to native and extended resources. The kube-scheduler will also take in account the per-class capacity -on the nodes if that is provided. +on the nodes if that is provided. No QoS-class specific scoring is added in the +the first implementation phase, thus all nodes satisfying the requests are +equally good (in terms of QoS-class resources). More diverese scheduling is +part of the [future work](#future-work). + +Pod eviction based on pod priority works similarly to other other resources: +lower priority pod(s) are evicted if deleting them makes a higher priority pod +schedulable (in this case by freeing up QoS-class resources). ### Kubectl From 00290e67556764a55672cc3f651c1aa8b315e9f6 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 7 Feb 2023 16:20:06 +0200 Subject: [PATCH 61/92] KEP-3008: address feedback from thockin and sftim - speculate on Kubernetes/kubelet internal QoS-resources under "Open Questions" - move "UpdatePodSandboxConfig" CRI API extenstion to non-goals - attempts for clearer wording in some places - update PodSpec: instead of using simple map[string]string specify separate struct types for QoS resource requests - add "Implicit defaults" to future work - update "summary list" in design details - added an example kubectl output for "describe node" - add "official" kubernetes QoS-class resource names to the API - add possibility to specify container-level QoS resources in pod-level request (to denote defaults for all containers within a pod) - changed names of annotations (dropped 'alpha') from the "Alternatives / Pod annotations" section --- .../3008-qos-class-resources/README.md | 241 +++++++++++++----- 1 file changed, 171 insertions(+), 70 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 8eee048aa3c..3754e338894 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -69,12 +69,12 @@ SIG Architecture for cross-cutting KEPs). - [Implementation phases](#implementation-phases) - [Phase 1](#phase-1) - [Future work](#future-work) - - [Update sandbox-level QoS-class resources](#update-sandbox-level-qos-class-resources) - [In-place pod vertical scaling](#in-place-pod-vertical-scaling) - [Access control](#access-control) - [Scheduler improvements](#scheduler-improvements) - [Kubelet-initiated pod eviction](#kubelet-initiated-pod-eviction) - [Default and limits](#default-and-limits) + - [Implicit defaults](#implicit-defaults) - [Proposal](#proposal) - [User Stories (Optional)](#user-stories-optional) - [Story 1](#story-1) @@ -90,10 +90,10 @@ SIG Architecture for cross-cutting KEPs). - [PodSandboxConfig](#podsandboxconfig) - [ContainerStatus](#containerstatus) - [RuntimeStatus](#runtimestatus) - - [Consts](#consts) - [Kubernetes API](#kubernetes-api) - [PodSpec](#podspec) - [NodeStatus](#nodestatus) + - [Consts](#consts) - [Kubelet](#kubelet) - [API server](#api-server) - [Scheduler](#scheduler) @@ -101,7 +101,7 @@ SIG Architecture for cross-cutting KEPs). - [Container runtimes](#container-runtimes) - [Open Questions](#open-questions) - [Pod QoS class](#pod-qos-class) - - [Default class](#default-class) + - [Other Kubernetes-managed QoS-class resources](#other-kubernetes-managed-qos-class-resources) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) - [Unit tests](#unit-tests) @@ -263,8 +263,6 @@ kubelet to the container runtime. resource types in the future. - Make QoS-class resources opqaue (as possible) to the CRI client - Discovery of the available QoS-class resources -- API changes to support updating Pod-level (sandbox-level) QoS-class resource - assignment of running pods - Resource status/capacity - Access control ([future work](#future-work)) @@ -272,7 +270,9 @@ kubelet to the container runtime. - Interface or mechanism for configuring the QoS-class resources (responsibility of the container runtime). -- Enumerating possible (QoS-class) resource types or their detailed behavior +- Specifying available (QoS-class) resource types or their detailed behavior +- API changes to support updating Pod-level (sandbox-level) QoS-class resource + assignment of running pods (will be subject to a separate KEP) ## Implementation phases @@ -321,27 +321,6 @@ are currently listed as "future work" in [Goals](#goals). In practice, the future work mostly consists of changes to the Kubernetes API and control plane components. -#### Update sandbox-level QoS-class resources - -This future step would be a second extension to the CRI API. - -Currently there is no endpoint in the CRI API to update the configuration of -pod sandboxes. In contrast, container-level resources can be updated with the -UpdateContainerResources API endpoint. In order to make container and pod -(sandbox) level QoS-class resources symmetric we want to make it possible to -update of pod-level resource assignments, too. - -This will likely require a new API endpoint in CRI: - -```diff -@@ -38,6 +38,8 @@ service RuntimeService { - // RunPodSandbox creates and starts a pod-level sandbox. Runtimes must ensure - // the sandbox is in the ready state on success. - rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse) {} -+ // UpdatePodSandboxConfig updates the configuration of an existing pod sandbox. -+ rpc UpdatePodSandboxConfig(UpdatePodSandboxConfigŔequest) returns (UpdatePodSandboxConfigŔesponse) {} -``` - #### In-place pod vertical scaling Properly integrating QoS-class resources to @@ -472,6 +451,26 @@ usage of container-level QoS-class resources. Just using LimitRanges for specifying defaults could simplify the API. +#### Implicit defaults + +If nothing is requested for a QoS-class resource, a pod/container still +implicitly belongs to some class. By design there is no such thing as being in +no QoS class. What this "unnamed" default class means or how it is handled is +considered an implementation detail of the runtime. However, it would be +potentially desirable for the user to be able to see what was classes the +application was implicitly assigned to (even if nothing was explicitly +requested). The first implementation phase contains no mechanism for this. + +Some considerations/questions: + +- where can user see what was implicitly given? new field in PodStatus? +- how to communicate what is effective for each pod and container +- implicit default might change after runtime re-configuration +- new QoS resource types might be added e.g. because of re-configuration of the + runtime +- PodSandboxStatus and ContainerStatus messages in CRI API could be used to + communicate selected/effextive defaults to the client (kubelet) + ## Proposal This section currently covers [implementation phase 1](#phase-1) (see @@ -507,16 +506,18 @@ that can be assigned to a specific class (of one QoS-class resource) on a node can be limited. This limit is configuratble on a per-class (and per-node) basis. This can be used to e.g. limit access to a high-tier class. -Pod-level and container-level QoS-class resources are completely independent -resource types. E.g. specifying something in the pod-level request does not -mean specifying a pod-level default for all containers of the pod. +Pod-level and container-level QoS-class resources are independent resource +types. However, specifying a container-level QoS-class resource something in +the pod-level request in PodSpec will be regarded by Kubernetres as a default +for all containers of the pod. Currently we identify two types of container-level QoS-class resources (RDT and blockio) but the API changes will be generic so that it will serve other -similar resources in the future. Currently there are no immediately enabled -pod-level QoS-class resources but we see usage scenarios for those in the -future (communicating the pod QoS class to the runtime and enabling pod-level -cgroup controls for blockio). +similar resources in the future. Currently there are no pod-level QoS-class +resources that would be enabled immediately after the new APIs are available +but we see usage scenarios for those in the future (communicating the pod QoS +class to the runtime and enabling pod-level cgroup controls for blockio, +network QoS). We introduce a feature gate that enables Kubernetes components (kubelet, kube-apiserver, kube-scheduler) to support QoS-class resources. @@ -571,6 +572,16 @@ Consider including folks who also work outside the SIG or subproject. - User assigning container to “unauthorized” class, causing interference and access to unwanted set/amount of resources. This will be addressed in future KEP introducing permission controls. +- Cross-registering pod and container-level QoS-class resources: it would be + possible to register a resource name as a pod-level resource on one node and + as a container-level resource on one node. This would obviously be confusing + for the user but also cause surprises in scheduling in some cases as + container-level QoS-class resources may be specified in pod-level request + field to denote pod-wide defaults for all containers within. This will be + mitigated in the future with "official" reqource names (we know which names + are pod-level resources and which container-level) but there is still + possibility to mix-up vendor-specific QoS-class resources, although the risk + for this is relatively small (severe misconfiguration) ## Design Details @@ -588,6 +599,18 @@ Summary of the proposed design details: - no overprovisioning (or auto-promotion to free classes) exists (considered as implementation detail of specific QoS-class resource, outside Kubernetes) + - not requesting anything implies a "default" that is an implementation + detail on the container runtime side +- Pod and container level QoS-class resources are separate concepts + - Pod-level QoS-class resources are meant for sandbox-level QoS + - Pod-level QoS-class resources cannot be specified in container resources field + - container-level QoS-class resources are meant for per-container QoS + - container-level QoS-class resources can be specified in pod-level request + field (in PodSpec) in which case it is regarded as a default for all + containers within that pod + - resource names across pod and container level QoS-class resource names + must be unique - i.e. the runtime cannot register a pod-level and + container-level resource with the same name - runtime advertises available QoS resources to kubelet - QoS resources are dynamic i.e. can change during the lifetime of the node - CRI [RuntimeStatus](#runtimestatus) message is used for carrying the information @@ -713,6 +736,10 @@ This information is dynamic i.e. the available QoS-class resources (or their properties like available classes or per-class capacity) may change over time, e.g. after re-configuration of the runtime. +Names of the QoS-class resources must be unique, also across pod-level and +container-level resources. I.e. the same name must not be advertised as both a +pod-level and container-level resource. + ```diff message RuntimeStatus { // List of current observed runtime conditions. @@ -732,7 +759,8 @@ e.g. after re-configuration of the runtime. +// QoSResourceInfo contains information about one type of QoS resource. +message QoSResourceInfo { -+ // Name of the QoS resources. ++ // Name of the QoS resources. Name must be unique, also across pod and ++ // container level QoS resources. + string Name = 1; + // Mutable is set to true if this resource supports in-place updates i.e. + // the class of a running container or sandbox can be changed. @@ -752,20 +780,6 @@ e.g. after re-configuration of the runtime. } ``` -#### Consts - -Also, define "known" QoS-class resource types to more easily align container -runtime implementations: - -```diff -+const ( -+ // QoSResourceRdt is the name of the RDT QoS-class resource -+ QoSResourceRdt = "rdt" -+ // QoSResourceBlockio is the name of the blockio QoS-class resource -+ QoSResourceBlockio = "blockio" -+) -``` - ### Kubernetes API The PodSpec will be extended to support assignment of pod-level and @@ -784,13 +798,21 @@ struct. This will enable the assignment of QoS resources for containers. // +featureGate=DynamicResourceAllocation // +optional Claims []ResourceClaim -+ // QoSResources specifies the QoS resources. ++ // QoSResources specifies the requested QoS resources. + // +optional -+ QoSResources map[QoSResourceName]string ++ QoSResources []QoSResourceRequest } +// QoSResourceName is the name of a QoS resource. +type QoSResourceName string + ++// QoSResourceRequest specifies a request for one QoS resource type. ++type QoSResourceRequest struct { ++ // Name of the QoS resource. ++ Name QoSResourceName ++ // Name of the class (inside the QoS resource type specified by Name field). ++ Class string ++} ``` Also, we add a Resources field to the PodSpec, to enable assignment of @@ -804,17 +826,28 @@ API. type PodSpec struct { @@ -3062,6 +3069,10 @@ type PodSpec struct { ResourceClaims []PodResourceClaim -+ // Pod-level resources. Claims, requests and limits are not allowed -+ // to be specified for pods. ++ // QoSResources specifies the Pod-level requests of QoS resources. ++ // Container-level QoS resources may be specified in which case they ++ // are considered as a default for all containers within the Pod. ++ // +featureGate=QoSResources + // +optional -+ Resources ResourceRequirements ++ QoSResources []PodQoSResourceRequest } + ++// PodQoSResourceRequest specifies a request for one QoS resource type for a ++// Pod. ++type PodQoSResourceRequest struct { ++ // Name of the QoS resource. ++ Name QoSResourceName ++ // Name of the class (inside the QoS resource type specified by Name field). ++ Class string ++} ``` -There is already an ongoing effort to add [Pod level resource limits][kep-2837] -that aims at adding a pod level `Resources` field in a similar fashion. Thus, -we opt for adding ResourceRequirements insted of QoSResources directly into the -PodSpec. +There is an ongoing effort to add [Pod level resource limits][kep-2837] that +aims at adding a pod level `Resources`. However, we propose to add a distinct +QoSResources field (with a distinct PodQoSResourceRequest type) in the PodSpec +in order to decouple dependencies between types and fields. As an example, a Pod requesting class "fast" of a (exemplary) pod-level QoS resource named "network", with one container requesting class "gold" of @@ -834,7 +867,8 @@ spec: image: nginx resources: qosResources: - rdt: gold + - name: rdt + class: gold ``` #### NodeStatus @@ -891,6 +925,39 @@ available within each of these resource types. +} ``` +#### Consts + +We define standard well-known QoS-class resource types in the API. These are +considered as the canonical list of "official" QoS-class resource names. In +addition to the name the API documents the expected high-level behavior and +semantics of these canonical QoS-class resources to provide common standards +across different implementations. + +<<[UNRESOLVED @sftim]>> +The canonical Kubernetes names for QoS-class resources are non-namespaced (i.e. +without a `/` prefix). Namespaced (or fully qualified) names like +`example.com/acme-qos` are not controlled and are meant for e.g. vendor or +application specific QoS implementations. +<<[/UNRESOLVED]>> + +```diff ++const ( ++ // QoSResourceRdt is the name of the QoS-class resource named IntelRDT ++ // in the OCI runtime spec and interfaced through the resctrlfs ++ // pseudp-filesystem in Linux. This is a container-level reosurce. ++ QoSResourceIntelRdt = "rdt" ++ // QoSResourceBlockio is the name of the blockio QoS-class resource. ++ // This is a container-level resource. ++ QoSResourceBlockio = "blockio" ++) +``` + +In later implementation phases (GA) admission control (validation) is added to +reject requests for unknown QoS-class resources in the "official" namespace. +Also, kubelet will reject the registration of unknown QoS-class resources in +the "official" namespace. Custom/vendor-specific QoS-class resources will still +be allowed outside the "official" namespace. + ### Kubelet Kubelet gets QoS-class resource assignment from the [PodSpec](#podspec) and @@ -905,7 +972,9 @@ Resources field in [RuntimeStatus](#runtimestatus) message). The kubelet updates the new QoSResources field in [NodeStatus](#nodestatus) accordingly, making QoS-class resources on the node visible to users and the kube-scheduler. This information is dynamic, i.e. the available QoS-class resources (or their -properties) may change over time. +properties) may change over time. QoS-class resource names must be unique, i.e. +kubelet will refuse to register pod-level and container-level QoS-class +resource with the same name. An admission handler is added into kubelet to validate the QoS-class resource request against the resource availability on the node. Pod is rejected if @@ -931,7 +1000,11 @@ labels is implemented: keys (`QoSResourceName`) and values must be non-empty, less than 64 characters long, must start and end with an alphanumeric character and may contain only alphanumeric characters, dashes, underscores or dots (`-`, `_` or `.`). Also similar to labels, a namespace prefix (FQDN subdomain separated -with a slash) in the key is allowed (e.g. `vendor/qos-resource`). +with a slash) in the key is allowed (e.g. `vendor.example/qos-resource`). + +Official canonical names for well-known QoS-class resources are specified in +the API. In later implementation phases admission (validation) for their usage +is implemented (see [Consts](#consts) for more details). ### Scheduler @@ -959,6 +1032,25 @@ The kubectl describe command will be extended to: - available QoS-class resources of nodes - display QoS-class resource requests of pods +For example, regarding available QoS-class resources on a node, `kubectl +describe node` could show output something like this: + +``` +... +Pod QoS resources: + Name Classes (capacity) + ---- ------------------ + net slow(inf) normal (8) fast (2) + qoz-xyz cls-1 (inf) cls-2 (1) cls-3 (5) cls-4 (10) + +Container QoS resources: + Name Mutable Classes (capacity) + ---- ------- ------------------ + blockio No low-prio (inf) normal (10) high-prio (3) + rdt No bronze (inf) gold (1) silver (2) +... +``` + ### Container runtimes Currently, there is support (container-level QoS-class resources) for Intel RDT @@ -995,12 +1087,17 @@ the pod qos class in the future. The runtime could provide a set of OOM classes, making it possible for the user to specify a burstable pod with low oom priority (low chance of being killed). -#### Default class +#### Other Kubernetes-managed QoS-class resources -A mechanism for indicating that the (runtime) default class should be used. The -default class would/should be a node/runtime specific attribute. How should -this be specified in the CRI protocol/`cri-api` and Pod spec? +In addition to the Pod QoS class it would be possible to specify other +QoS-class resources that would be managed by Kubernetes/kubelet. If we specify +and manage official well-known QoS-class resource names in the API it would be +possible to specify Kubernetes-internal names that the container runtime would +know to ignore (or not try to manage itself). +One possible usage-scenario would be pod-level cgroup controls, e.g. cgroup v2 +memory knobs in linux (see +[KEP-2570: Support Memory QoS with cgroups v2][kep-2570]. ### Test Plan @@ -1164,7 +1261,10 @@ in back-to-back releases. - More rigorous forms of testing—e.g., downgrade tests and scalability tests - Allowing time for feedback - +- In addition to beta: + - enforce admission control on "official" QoS-class resource names + - kubelet rejects registration of unknown QoS-class resorce names in the + "official" namespace ### Upgrade / Downgrade Strategy @@ -1687,14 +1787,14 @@ Specifically, annotations for specifying RDT and blockio class would be supported. These are the two types of QoS-class resources that already have basic support in the container runtimes. -- `rdt.resources.alpha.kubernetes.io/default` for setting a Pod-level default RDT +- `rdt.resources.kubernetes.io/default` for setting a Pod-level default RDT class for all containers -- `rdt.resources.alpha.kubernetes.io/container.` for +- `rdt.resources.kubernetes.io/container.` for container-specific RDT class settings blockio class for all containers -- `blockio.resources.alpha.kubernetes.io/default` for setting a Pod-level default +- `blockio.resources.kubernetes.io/default` for setting a Pod-level default blockio class for all containers -- `blockio.resources.alpha.kubernetes.io/container.` for +- `blockio.resources.kubernetes.io/container.` for container-specific blockio class settings #### Kubelet @@ -1796,3 +1896,4 @@ required. [intel-rdt]: https://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html [linux-resctrl]: https://www.kernel.org/doc/html/latest/x86/resctrl.html [kep-2837]: https://github.com/kubernetes/enhancements/pull/1592 +[kep-2570]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos From 8805ce25bc30c35b1c038ce437594d90a51f06d6 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Wed, 8 Feb 2023 11:10:50 +0200 Subject: [PATCH 62/92] KEP-3008: address review comments - clarify the expected behavior on CRI UpdateContainerResources request - update / clearer wording (limit ranges, design details) - fix typos - slightly updated the "Implicit defaults" section - renamed code/yaml snippets and feature gate QoSResource -> QOSResource - marked Capacity field in QOSResourceClassInfo in NodeStatus as UNRESOLVED --- .../3008-qos-class-resources/README.md | 190 +++++++++--------- 1 file changed, 99 insertions(+), 91 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 3754e338894..ea2aba00223 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -348,26 +348,26 @@ would implement restrictions based on the namespace. // object tracked by a quota but expressed using ScopeSelectorOperator in combination // with possible values. ScopeSelector *ScopeSelector -+ // QoSResources contains the desired set of allowed class resources. -+ // +featureGate=QoSResources ++ // QOSResources contains the desired set of allowed class resources. ++ // +featureGate=QOSResources + // +optional -+ QoSResources QoSResourceQuota ++ QOSResources QOSResourceQuota } -+// QoSResourceQuota contains the allowed class resources. -+type QoSResourceQuota struct { ++// QOSResourceQuota contains the allowed class resources. ++type QOSResourceQuota struct { + // Pod contains the allowed class resources for pods. + // +optional -+ Pod []AllowedQoSResource ++ Pod []AllowedQOSResource + // Container contains the allowed class resources for pods. + // +optional -+ Container []AllowedQoSResource ++ Container []AllowedQOSResource +} -+// AllowedQoSResource specifies access to one QoS-class resources type. -+type AllowedQoSResource struct { ++// AllowedQOSResource specifies access to one QoS-class resources type. ++type AllowedQOSResource struct { + // Name of the resource. -+ Name QoSResourceName ++ Name QOSResourceName + // Allowed classes. + Classes []string + // Capacity is the hard limit for usage of the class. @@ -382,10 +382,10 @@ would implement restrictions based on the namespace. // Used is the current observed total usage of the resource in the namespace // +optional Used ResourceList -+ // QoSResources contains the enforced set of available class resources. -+ // +featureGate=QoSResources ++ // QOSResources contains the enforced set of available class resources. ++ // +featureGate=QOSResources + // +optional -+ QoSResources QoSResourceQuota ++ QOSResources QOSResourceQuota } ``` @@ -424,24 +424,24 @@ usage of container-level QoS-class resources. // MaxLimitRequestRatio represents the max burst value for the named resource // +optional MaxLimitRequestRatio ResourceList -+ // QoSResources specifies the limits for QoS resources. -+ QoSResources []LimitQoSResource ++ // QOSResources specifies the limits for QoS resources. ++ QOSResources []LimitQOSResource +} + -+// LimitQoSResource specifies limits of one QoS resources type. -+type LimitQoSResource struct { ++// LimitQOSResource specifies limits of one QoS resources type. ++type LimitQOSResource struct { + // Name of the resource. -+ Name QoSResourceName ++ Name QOSResourceName + // Default specifies the default class to be assigned. + // +optional + Default string + // Max usage of classes + // +optional -+ Max []QoSResourceClassLimit ++ Max []QOSResourceClassLimit +} + -+// QoSResourceClassLimit specifies a limit for one class of a QoS resource. -+type QoSResourceClassLimit struct { ++// QOSResourceClassLimit specifies a limit for one class of a QoS resource. ++type QOSResourceClassLimit struct { + // Name of the class. + Name string + // Capacity is the limit for usage of the class. @@ -449,7 +449,8 @@ usage of container-level QoS-class resources. } ``` -Just using LimitRanges for specifying defaults could simplify the API. +Not supporting Max (i.e. only supporting Default) in LimitRanges could simplify +the API. #### Implicit defaults @@ -463,13 +464,13 @@ requested). The first implementation phase contains no mechanism for this. Some considerations/questions: -- where can user see what was implicitly given? new field in PodStatus? -- how to communicate what is effective for each pod and container +- new field in PodStatus could be used for this - implicit default might change after runtime re-configuration - new QoS resource types might be added e.g. because of re-configuration of the runtime - PodSandboxStatus and ContainerStatus messages in CRI API could be used to - communicate selected/effextive defaults to the client (kubelet) + communicate selected/effextive defaults to the client (kubelet) and kubelet + would update Pod status accordingly ## Proposal @@ -593,7 +594,7 @@ fully handled by the underlying container runtime and is invisible to kubelet. Summary of the proposed design details: - QoS resources are opaque (just names) to kubernetes, configuration and - management of QoS resources is handled in the container runtime + management of QoS resources is handled in the node (container runtime, kubelet) - no "system reserved" (or equivalent) exists (considered as configuration detail outside Kubernetes) - no overprovisioning (or auto-promotion to free classes) exists (considered @@ -651,12 +652,12 @@ resource assignments to the runtime. WindowsContainerConfig windows = 16; + + // Configuration of QoS resources. -+ ContainerQoSResources qos_resources = 17; ++ ContainerQOSResources qos_resources = 17; } -+// ContainerQoSResources specifies the configuration of QoS resources of a ++// ContainerQOSResources specifies the configuration of QoS resources of a +// container. -+message ContainerQoSResources { ++message ContainerQOSResources { + // QoS resources the container will be assigned to. + // Key-value pairs where key is name of the QoS resource and value is the + // name of the class. @@ -668,10 +669,15 @@ resource assignments to the runtime. Similar to `CreateContainerRequest`, the `UpdateContainerResourcesRequest` message will extended to allow updating of QoS-class resource configuration of -a running container. Depending on runtime-level support of a particular -resource (and possibly the type of resource) UpdateContainerResourcesRequest -might fail. Resource discovery (see [Runtime status](#runtime-status)) has -the capability to distinguish immutable resource types. +a running container. Depending on runtime-level support of a particular +resource (and possibly the type of resource) in-place updates of running +containers might not be possible. Resource discovery (see +[Runtime status](#runtime-status)) has the capability to distinguish whether a +particular QoS-class resource supports in-place updates (the Mutable field in +QOSResourceInfo message). UpdateContainerResources must be atomic so the +runtime must fail early (before e.g. calling the OCI runtime to make any +changes to other resources) if an attempt to update an "immutable" QoS-class +resource is requested. Note that neither of the existing QoS-class resource types (RDT or blockio) support updates because of runtime limitations, yet. @@ -683,7 +689,7 @@ support updates because of runtime limitations, yet. // resources to update or other options to use when updating the container. map annotations = 4; + // Configuration of QoS resources. -+ ContainerQoSResources qos_resources = 5; ++ ContainerQOSResources qos_resources = 5; } ``` @@ -701,11 +707,11 @@ assignments at sandbox creation time (`RunPodSandboxRequest`). // Optional configurations specific to Windows hosts. WindowsPodSandboxConfig windows = 9; + // Configuration of QoS resources. -+ PodQoSResources qos_resources = 10; ++ PodQOSResources qos_resources = 10; } -+// PodQoSResources specifies the configuration of QoS resources of a pod. -+message PodQoSResources { ++// PodQOSResources specifies the configuration of QoS resources of a pod. ++message PodQOSResources { + // QoS resources the pod will be assigned to. + // Key-value pairs where key is name of the QoS resource and value is the + // name of the class. @@ -725,7 +731,7 @@ resources. // Resource limits configuration specific to Windows container. WindowsContainerResources windows = 2; + // Configuration of QoS resources. -+ ContainerQoSResources qos_resources = 3; ++ ContainerQOSResources qos_resources = 3; ``` #### RuntimeStatus @@ -752,13 +758,13 @@ pod-level and container-level resource. +// runtime. +message ResourcesInfo { + // Pod-level QoS resources available. -+ repeated QoSResourceInfo pod_qos_resources = 1; ++ repeated QOSResourceInfo pod_qos_resources = 1; + // Container-level QoS resources available. -+ repeated QoSResourceInfo container_qos_resources = 2; ++ repeated QOSResourceInfo container_qos_resources = 2; +} -+// QoSResourceInfo contains information about one type of QoS resource. -+message QoSResourceInfo { ++// QOSResourceInfo contains information about one type of QoS resource. ++message QOSResourceInfo { + // Name of the QoS resources. Name must be unique, also across pod and + // container level QoS resources. + string Name = 1; @@ -766,12 +772,12 @@ pod-level and container-level resource. + // the class of a running container or sandbox can be changed. + bool Mutable = 2; + // List of classes of this QoS resource. -+ repeated QoSResourceClassInfo classes = 3; ++ repeated QOSResourceClassInfo classes = 3; +} -+// QoSResourceClassInfo contains information about one class of certain ++// QOSResourceClassInfo contains information about one class of certain +// QoS resource. -+message QoSResourceClassInfo { ++message QOSResourceClassInfo { + // Name of the class + string name = 1; + // Capacity is the number of maximum allowed simultaneous assignments into this class @@ -788,7 +794,7 @@ information about the available QoS-class resources on a node. #### PodSpec -We introduce a new field, QoSResources into the existing ResourceRequirements +We introduce a new field, QOSResources into the existing ResourceRequirements struct. This will enable the assignment of QoS resources for containers. ```diff @@ -798,18 +804,18 @@ struct. This will enable the assignment of QoS resources for containers. // +featureGate=DynamicResourceAllocation // +optional Claims []ResourceClaim -+ // QoSResources specifies the requested QoS resources. ++ // QOSResources specifies the requested QoS resources. + // +optional -+ QoSResources []QoSResourceRequest ++ QOSResources []QOSResourceRequest } -+// QoSResourceName is the name of a QoS resource. -+type QoSResourceName string ++// QOSResourceName is the name of a QoS resource. ++type QOSResourceName string -+// QoSResourceRequest specifies a request for one QoS resource type. -+type QoSResourceRequest struct { ++// QOSResourceRequest specifies a request for one QoS resource type. ++type QOSResourceRequest struct { + // Name of the QoS resource. -+ Name QoSResourceName ++ Name QOSResourceName + // Name of the class (inside the QoS resource type specified by Name field). + Class string +} @@ -817,28 +823,28 @@ struct. This will enable the assignment of QoS resources for containers. Also, we add a Resources field to the PodSpec, to enable assignment of pod-level QoS resources. We will re-use the existing ResourceRequirements type -but Limits and Requests and Claims must be left empty. QoSResources may be set +but Limits and Requests and Claims must be left empty. QOSResources may be set and they represent the Pod-level assignment of QoS-class resources, -corresponding the PodQoSResources message in PodSandboxConfig in the CRI +corresponding the PodQOSResources message in PodSandboxConfig in the CRI API. ```diff type PodSpec struct { @@ -3062,6 +3069,10 @@ type PodSpec struct { ResourceClaims []PodResourceClaim -+ // QoSResources specifies the Pod-level requests of QoS resources. ++ // QOSResources specifies the Pod-level requests of QoS resources. + // Container-level QoS resources may be specified in which case they + // are considered as a default for all containers within the Pod. -+ // +featureGate=QoSResources ++ // +featureGate=QOSResources + // +optional -+ QoSResources []PodQoSResourceRequest ++ QOSResources []PodQOSResourceRequest } -+// PodQoSResourceRequest specifies a request for one QoS resource type for a ++// PodQOSResourceRequest specifies a request for one QoS resource type for a +// Pod. -+type PodQoSResourceRequest struct { ++type PodQOSResourceRequest struct { + // Name of the QoS resource. -+ Name QoSResourceName ++ Name QOSResourceName + // Name of the class (inside the QoS resource type specified by Name field). + Class string +} @@ -846,7 +852,7 @@ API. There is an ongoing effort to add [Pod level resource limits][kep-2837] that aims at adding a pod level `Resources`. However, we propose to add a distinct -QoSResources field (with a distinct PodQoSResourceRequest type) in the PodSpec +QOSResources field (with a distinct PodQOSResourceRequest type) in the PodSpec in order to decouple dependencies between types and fields. As an example, a Pod requesting class "fast" of a (exemplary) pod-level QoS @@ -877,45 +883,43 @@ We extend NodeStatus to list available QoS-class resources on a node, This consists of the list of the available QoS-class resource types and the classes available within each of these resource types. - - ```diff type NodeStatus struct { @@ -4444,6 +4482,11 @@ type NodeStatus struct { // Status of the config assigned to the node via the dynamic Kubelet config feature. Config *NodeConfigStatus -+ // QoSResources contains information about the QoS resources that are ++ // QOSResources contains information about the QoS resources that are + // available on the node. -+ // +featureGate=QoSResources ++ // +featureGate=QOSResources + // +optional -+ QoSResources QoSResourceStatus ++ QOSResources QOSResourceStatus } ... -+// QoSResourceStatus describes QoS resources available on the node. -+type QoSResourceStatus struct { -+ // PodQoSResources contains the QoS resources that are available for pods ++// QOSResourceStatus describes QoS resources available on the node. ++type QOSResourceStatus struct { ++ // PodQOSResources contains the QoS resources that are available for pods + // to be assigned to. -+ PodQoSResources []QoSResourceInfo -+ // ContainerQoSResources contains the QoS resources that are available for ++ PodQOSResources []QOSResourceInfo ++ // ContainerQOSResources contains the QoS resources that are available for + // containers to be assigned to. -+ ContainerQoSResources []QoSResourceInfo ++ ContainerQOSResources []QOSResourceInfo +} -+// QoSResourceInfo contains information about one QoS resource type. -+type QoSResourceInfo struct { ++// QOSResourceInfo contains information about one QoS resource type. ++type QOSResourceInfo struct { + // Name of the resource. -+ Name QoSResourceName ++ Name QOSResourceName + // Mutable is set to true if the resource supports in-place updates. + Mutable bool + // Classes available for assignment. -+ Classes []QoSResourceClassInfo ++ Classes []QOSResourceClassInfo +} -+// QoSResourceClassInfo contains information about single class of one QoS ++// QOSResourceClassInfo contains information about single class of one QoS +// resource. -+type QoSResourceClassInfo struct { ++type QOSResourceClassInfo struct { + // Name of the class. + Name string + // Capacity is the number of maximum allowed simultaneous assignments into this class @@ -925,6 +929,10 @@ available within each of these resource types. +} ``` +<<[UNRESOLVED @thockin ]>> +Class capacity i.e. Capacity field in QOSResourceClassInfo. +<<[/UNRESOLVED]>> + #### Consts We define standard well-known QoS-class resource types in the API. These are @@ -942,13 +950,13 @@ application specific QoS implementations. ```diff +const ( -+ // QoSResourceRdt is the name of the QoS-class resource named IntelRDT ++ // QOSResourceRdt is the name of the QoS-class resource named IntelRDT + // in the OCI runtime spec and interfaced through the resctrlfs + // pseudp-filesystem in Linux. This is a container-level reosurce. -+ QoSResourceIntelRdt = "rdt" -+ // QoSResourceBlockio is the name of the blockio QoS-class resource. ++ QOSResourceIntelRdt = "rdt" ++ // QOSResourceBlockio is the name of the blockio QoS-class resource. + // This is a container-level resource. -+ QoSResourceBlockio = "blockio" ++ QOSResourceBlockio = "blockio" +) ``` @@ -961,15 +969,15 @@ be allowed outside the "official" namespace. ### Kubelet Kubelet gets QoS-class resource assignment from the [PodSpec](#podspec) and -translates these into corresponding `QoSResources` data in the CRI API. This is +translates these into corresponding `QOSResources` data in the CRI API. This is ContainerConfig message at container creation time (CreateContainerRequest) and PodSandboxConfig at sandbox creation time (RunPodSandboxRequest). In practice, -there is no translation, just copying key-value pairse. +there is no translation, just copying key-value pairs. Kubelet will receive the information about available QoS-class resources (the types of reqources and their classes) from the runtime over the CRI API (new Resources field in [RuntimeStatus](#runtimestatus) message). The kubelet -updates the new QoSResources field in [NodeStatus](#nodestatus) accordingly, +updates the new QOSResources field in [NodeStatus](#nodestatus) accordingly, making QoS-class resources on the node visible to users and the kube-scheduler. This information is dynamic, i.e. the available QoS-class resources (or their properties) may change over time. QoS-class resource names must be unique, i.e. @@ -990,13 +998,13 @@ an update of changed QoS-class resource availability from the runtime). No kubelet-initiated pod eviction is implemented in the first implementation phase. -A feature gate QoSResources enables kubelet to update the QoS-class resources +A feature gate QOSResources enables kubelet to update the QoS-class resources in NodeStatus and handle QoS-class resource requests in the PodSpec. ### API server Input validation of QoS-class resource names and class names, very similar to -labels is implemented: keys (`QoSResourceName`) and values must be non-empty, +labels is implemented: keys (`QOSResourceName`) and values must be non-empty, less than 64 characters long, must start and end with an alphanumeric character and may contain only alphanumeric characters, dashes, underscores or dots (`-`, `_` or `.`). Also similar to labels, a namespace prefix (FQDN subdomain separated @@ -1342,7 +1350,7 @@ well as the [existing list] of feature gates. --> - [x] Feature gate (also fill in values in `kep.yaml`) - - Feature gate name: QoSResources + - Feature gate name: QOSResources - Components depending on the feature gate: - Implementation Phase 1: - kubelet @@ -1800,11 +1808,11 @@ basic support in the container runtimes. #### Kubelet Kubelet would interpret the specific [pod annotations](#pod-annotations) and -translate them into corresponding `QoSResources` data in the CRI +translate them into corresponding `QOSResources` data in the CRI ContainerConfig message at container creation time (CreateContainerRequest). Pod-level QoS-class would not supported at this point (via pod annotations). -A feature gate QoSResources would enable kubelet to interpretthe specific pod +A feature gate QOSResources would enable kubelet to interpretthe specific pod annotations. If the feature gate is disabled the annotations would simply be ignored by kubelet. From 8cac4a52ae3263a70f1041057648609518debc8f Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Wed, 8 Feb 2023 14:01:40 +0200 Subject: [PATCH 63/92] KEP-3008: small update to CRI API comments Based on feedback from haircommander. --- keps/sig-node/3008-qos-class-resources/README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index ea2aba00223..8016ca82978 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -689,6 +689,9 @@ support updates because of runtime limitations, yet. // resources to update or other options to use when updating the container. map annotations = 4; + // Configuration of QoS resources. ++ // Note that UpdateContainerResourcesRequest must be atomic so that the ++ // runtime ensure that the requested update to QoS resources can be applied ++ // before e.g. updating other resources. + ContainerQOSResources qos_resources = 5; } ``` From 0daaea9a0f62d2fb38d680526f5ac02854452912 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Fri, 10 Feb 2023 09:52:50 +0200 Subject: [PATCH 64/92] KEP-3008: report back assignments in Pod status Extend PodStatus and ContainerStatus in K8s API to show effective assignemts. Extend PodSandboxStatus CRI message to report back pod-level assignments. These changes allow reporting back defaults imposed by the runtime for resources that nothing was requested by the user. Based on feedback from thockin. Also fix a mistake in example pod spec. --- .../3008-qos-class-resources/README.md | 120 +++++++++++++----- 1 file changed, 91 insertions(+), 29 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 8016ca82978..3d750604f78 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -74,7 +74,6 @@ SIG Architecture for cross-cutting KEPs). - [Scheduler improvements](#scheduler-improvements) - [Kubelet-initiated pod eviction](#kubelet-initiated-pod-eviction) - [Default and limits](#default-and-limits) - - [Implicit defaults](#implicit-defaults) - [Proposal](#proposal) - [User Stories (Optional)](#user-stories-optional) - [Story 1](#story-1) @@ -85,13 +84,16 @@ SIG Architecture for cross-cutting KEPs). - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) - [CRI API](#cri-api) + - [Implicit defaults](#implicit-defaults) - [ContainerConfig](#containerconfig) - [UpdateContainerResourcesRequest](#updatecontainerresourcesrequest) - [PodSandboxConfig](#podsandboxconfig) + - [PodSandboxStatus](#podsandboxstatus) - [ContainerStatus](#containerstatus) - [RuntimeStatus](#runtimestatus) - [Kubernetes API](#kubernetes-api) - [PodSpec](#podspec) + - [PodStatus](#podstatus) - [NodeStatus](#nodestatus) - [Consts](#consts) - [Kubelet](#kubelet) @@ -452,26 +454,6 @@ usage of container-level QoS-class resources. Not supporting Max (i.e. only supporting Default) in LimitRanges could simplify the API. -#### Implicit defaults - -If nothing is requested for a QoS-class resource, a pod/container still -implicitly belongs to some class. By design there is no such thing as being in -no QoS class. What this "unnamed" default class means or how it is handled is -considered an implementation detail of the runtime. However, it would be -potentially desirable for the user to be able to see what was classes the -application was implicitly assigned to (even if nothing was explicitly -requested). The first implementation phase contains no mechanism for this. - -Some considerations/questions: - -- new field in PodStatus could be used for this -- implicit default might change after runtime re-configuration -- new QoS resource types might be added e.g. because of re-configuration of the - runtime -- PodSandboxStatus and ContainerStatus messages in CRI API could be used to - communicate selected/effextive defaults to the client (kubelet) and kubelet - would update Pod status accordingly - ## Proposal This section currently covers [implementation phase 1](#phase-1) (see @@ -621,6 +603,8 @@ Summary of the proposed design details: - relays QoS-class resource requests from PodSpec to the runtime at pod startup via [ContainerConfig](#containerconfig) and [PodSandboxConfig](#podsandboxconfig) messages + - updates Pod status to reflect the assignment of QoS-class resources, based + on pod/container status reveived from the runtime - implements admission handler that validates the availability of QoS-class resources - pod eviction as a possible [future improvement](#kubelet-initiated-pod-eviction) - scheduler @@ -635,6 +619,26 @@ Summary of the proposed design details: The following additions to the CRI protocol are suggested. +#### Implicit defaults + +Any defaults imposed by the runtime, i.e. for resources that the user didn't +request anything, must be reflected in the PodSandboxStatus and ContainerStatus +messages. This information (about implicit defaults) can change during the +lifetime of the Pod e.g. because of re-configuration of the runtime: + +- new QoS resource types might be added in which case a Pod/Container is + assigned to the default class of this new resource and this information must + be reflected in PodSandboxStatus/ContainerStatus +- the default class of an existing QoS resource type might change and the + Pod/Container migrated to this class and this information must be reflected + in PodSandboxStatus/ContainerStatus + +An empty value (empty string) denotes "system default" i.e. the runtime did not +enforce any QoS for this specific type of QoS-class resource. An example can be +for example Linux cgroup controls where "system default" would mean that the +runtime did not enforce any changes so they will be set to the default values +determined by the system configuration. + #### ContainerConfig The `ContainerConfig` message will be supplemented with new `class_resources` @@ -722,11 +726,30 @@ assignments at sandbox creation time (`RunPodSandboxRequest`). +} ``` +#### PodSandboxStatus + +The `PodSandboxStatus` message will be extended to report back assignment of +pod-level QoS-class resources. The runtime must report back assignment of all +supported QoS-class resources, also defaults for resources that the client +didn't request anythin (see [implicit defaults](#implicit-defaults)) + +```diff +@@ -545,6 +545,8 @@ message PodSandboxStatus { + // runtime configuration used for this PodSandbox. + string runtime_handler = 9; ++ // Configuration of QoS resources. ++ PodQOSResources qos_resources = 10; + } + +``` + #### ContainerStatus The `ContainerResources` message (part of `ContainerStatus`) will be extended to report back QoS-class resource assignments of a container, similar to other -resources. +resources. The runtime must report back assignment of all supported QoS-class +resources, also defaults for resources that the client didn't request anythin +(see [implicit defaults](#implicit-defaults)) ```diff @@ -1251,6 +1269,8 @@ message ContainerResources { @@ -868,9 +891,9 @@ kind: Pod metadata: name: qos-resource-example spec: - resources: - qosResources: - network: fast + qosResources: + - name: network + class: fast containers: - name: cnt image: nginx @@ -880,6 +903,41 @@ spec: class: gold ``` +#### PodStatus + +PodStatus and ContainerStatus types are extended to include information of +QoS-class resource assignments. The main motivation for this change is to +communicate to the user "implicit defaults", i.e. report back what was assigned +for QoS-class resources for which nothing was requested (see +[implicit defaults](#implicit-defaults)) + +```diff +type PodStatus struct { +@@ -3534,6 +3537,10 @@ type PodStatus struct { + // Status for any ephemeral containers that have run in this pod. + // +optional + EphemeralContainerStatuses []ContainerStatus ++ ++ // QOSResources shows the assignment of pod-level QoS resources. ++ // +optional ++ QOSResources []PodQOSResourceRequest + } +``` + +```diff +type ContainerStatus struct { +@@ -2437,6 +2437,9 @@ type ContainerStatus struct { + Started *bool ++ ++ // QOSResources shows the assignment of QoS resources. ++ // +optional ++ QOSResources []QOSResourceRequest + } +``` + +Status will re-use the PodQOSResourceRequest and QOSResourceRequest that are +specified as part of [PodSpec](#podspec) update. + #### NodeStatus We extend NodeStatus to list available QoS-class resources on a node, This @@ -932,10 +990,6 @@ available within each of these resource types. +} ``` -<<[UNRESOLVED @thockin ]>> -Class capacity i.e. Capacity field in QOSResourceClassInfo. -<<[/UNRESOLVED]>> - #### Consts We define standard well-known QoS-class resource types in the API. These are @@ -987,6 +1041,14 @@ properties) may change over time. QoS-class resource names must be unique, i.e. kubelet will refuse to register pod-level and container-level QoS-class resource with the same name. +Kubelet receives information about actual QoS-class resource assignment of pods +and containers from the runtime over the CRI API +([PodSandboxStatus](#podsandboxstatus) and [ContainerStatus](#containerstatus) +messages). Kubelet updates [PodStatus](#podstatus) accordingly. This is +especially to communicate defaults applied by the runtime back to the user +(see [implicit defaults](#implicit-defaults)). Note that this information may +change over the lifetime of the pod. + An admission handler is added into kubelet to validate the QoS-class resource request against the resource availability on the node. Pod is rejected if sufficient resources do not exist. This also applies to static pods. From 5d8e79e1547da7cf58a8f73fc4312b86734899d0 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Fri, 10 Feb 2023 13:21:37 +0200 Subject: [PATCH 65/92] KEP-3008: add Pod QoS class to the main proposal Put it in proposal to gather feedback. Mark it as unresolved. --- .../3008-qos-class-resources/README.md | 47 ++++++++++++------- 1 file changed, 31 insertions(+), 16 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 3d750604f78..948b35e1c56 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -96,13 +96,13 @@ SIG Architecture for cross-cutting KEPs). - [PodStatus](#podstatus) - [NodeStatus](#nodestatus) - [Consts](#consts) + - [Pod QoS class](#pod-qos-class) - [Kubelet](#kubelet) - [API server](#api-server) - [Scheduler](#scheduler) - [Kubectl](#kubectl) - [Container runtimes](#container-runtimes) - [Open Questions](#open-questions) - - [Pod QoS class](#pod-qos-class) - [Other Kubernetes-managed QoS-class resources](#other-kubernetes-managed-qos-class-resources) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) @@ -1023,6 +1023,35 @@ Also, kubelet will reject the registration of unknown QoS-class resources in the "official" namespace. Custom/vendor-specific QoS-class resources will still be allowed outside the "official" namespace. +### Pod QoS class + +<<[UNRESOLVED]>> +The [Pod QoS class][pod-qos-class] will be communicated to the container +runtime as a special Kubernetes-specific QoS-class resource. + +Information about Pod QoS class is currently internal to kubelet and not +visible to the container runtime. However, container runtimes (CRI-O, at least) +are already depending on this information and currently determining it +indirectly by evaluating other CRI parameters. + +This change makes the information about Pod QoS explicit and allows elimination +of unreliable code paths in runtimes, for example. + +Pod QoS class will not advertised as one of the available QoS-class resources +in NodeStatus. Also, users are not allowed to request it in the PodSpec which +will be enforced by admission checks in the api-server and kubelet. + +```diff ++ // Kubernetes-managed QoS resources. These are only informational to the runtime ++ // and no CRI requests should fail because of these. ++const ( ++ // QOSResourcePodQOS is the Kubernetes Pod Quality of Service Class. ++ // Possible values are "Guaranteed", "Burstable" and "BestEffort". ++ QOSResourcePodQOS = "pod-qos" ++) +``` +<<[/UNRESOLVED]>> + ### Kubelet Kubelet gets QoS-class resource assignment from the [PodSpec](#podspec) and @@ -1145,21 +1174,6 @@ Container runtimes will be updated to support the ### Open Questions -#### Pod QoS class - -The Pod QoS class could be communicated to the container runtime as a QoS-class -resource, too. This information is currently internal to kubelet. However, -container runtimes (CRI-O, at least) are already depending on this information -and currently determining it indirectly by evaluating other CRI parameters. It -would be better to explicitly state the Pod QoS class and QoS-class resources would -look like a logical place for that. This also makes it techically possible to -have container-specific QoS classes (as a possible future enhancement of K8s). - -Making this change, it would also be possible to separate `oom_score_adj` from -the pod qos class in the future. The runtime could provide a set of OOM -classes, making it possible for the user to specify a burstable pod with low -oom priority (low chance of being killed). - #### Other Kubernetes-managed QoS-class resources In addition to the Pod QoS class it would be possible to specify other @@ -1970,3 +1984,4 @@ required. [linux-resctrl]: https://www.kernel.org/doc/html/latest/x86/resctrl.html [kep-2837]: https://github.com/kubernetes/enhancements/pull/1592 [kep-2570]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos +[pod-qos-class]: https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/ From a9c33f53432c665108ab683ccbcc188f76b3aa1d Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Fri, 10 Feb 2023 14:19:34 +0200 Subject: [PATCH 66/92] KEP-3008: fill in version skew strategy --- .../3008-qos-class-resources/README.md | 23 +++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 948b35e1c56..744d9ad933f 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -1382,6 +1382,29 @@ enhancement: CRI or CNI may require updating that component before the kubelet. --> +It is possible to run control plane and worker nodes with mismatched versions. +However, these scenarios should be handled by the API(s). + +If control plane has the feature enabled but some nodes have it disabled, those +nodes are simply seen as having no QoS-class resources available. + +If control plane has the feature disabled but some nodes have it enabled the +nodes are simply unable to advertise the QoS-class resources available on the +nodes to the control plane. One potential gap is that users have no visibility +to implicit defaults imposed by the runtime (see +[implicit defaults](#implicit-defaults)). + +If a worker node has a container runtime with QoS-class resources enabled but +the kubelet on the node has the feature disabled the node is not able to +advertise the resources to the control plane, making them effectively +unavailable to the users. Also in this case users have no visibility to +implicit defaults imposed by the runtime (see +[implicit defaults](#implicit-defaults)). + +If a worker node is running a container runtime that does not support QoS-class +resources the node is simply seen in the Kubernetes API as one having no +QoS-class resources available. + ## Production Readiness Review Questionnaire +The identified risks are mainly related to usability or unfriendly users +hogging all available high-priority QoS. + - User assigning container to “unauthorized” class, causing interference and access to unwanted set/amount of resources. This will be addressed in future KEP introducing permission controls. @@ -565,6 +568,29 @@ Consider including folks who also work outside the SIG or subproject. are pod-level resources and which container-level) but there is still possibility to mix-up vendor-specific QoS-class resources, although the risk for this is relatively small (severe misconfiguration) +- A node can only serve a limited amount of users of high-priority QoS. This + is mitigated by the Capacity attribute of classes, limiting the number of + simultaneous users of a class. This information if per-node and it is + determined by the system and/or runtime contiguration of the node, outside + Kubernetes. It is the responsibility of the node administrator to configure + meaningful capacity limits for QoS-class resources that require it. +- User does not request any QoS but the runtime applies some defaults on the + application, limiting service level. This is mitigated by user being able + to see the effective assignments in Pod status. +- User accidentally specifies pod-level QoS-class resources for a container. + For "official" well-known QoS-class resources (see [Consts](#consts)) this + will be mitigated by admission check, rejecting the Pod. For vendor-specific + QoS-class resources the Pod will stay in pending state, with Pod status + indicating that no nodes are available with a message describing the detailed + reason of unavailable container-level QoS-class resource type. +- User mistakenly specifies a scontainer-level QoS-class resources as a + pod-level request. This is not regarded as an error or misconfiguration + but the request is regarded as a pod-wide default for all containers. + However, an unintended consequence may be that some containers are run + at on unintended QoS level or the pod becomes unschedulable because no nodes + have enough capacity to satisfy a high-priority QoS-class for all containers. + In the latter case the Pod stays in pending state with Pod status + describing the details of insufficient QoS-class resources. ## Design Details From a1c14ee7f11a8216475182fc8f29fa05b1649b68 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Fri, 10 Feb 2023 16:43:32 +0200 Subject: [PATCH 68/92] KEP-3008: update user stories Also adds some wild speculation on possible usages. --- .../3008-qos-class-resources/README.md | 193 +++++++++++++++--- 1 file changed, 161 insertions(+), 32 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 27194b59248..e2ed2c40c75 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -76,10 +76,13 @@ SIG Architecture for cross-cutting KEPs). - [Default and limits](#default-and-limits) - [Proposal](#proposal) - [User Stories (Optional)](#user-stories-optional) - - [Story 1](#story-1) - - [Story 2](#story-2) - - [Story 3](#story-3) - - [Story 4](#story-4) + - [Mitigating noisy neighbors](#mitigating-noisy-neighbors) + - [Vendor-specific QoS](#vendor-specific-qos) + - [Defaults and limits](#defaults-and-limits) + - [Possible future scenarios](#possible-future-scenarios) + - [Kubernetes-managed QoS-class resources](#kubernetes-managed-qos-class-resources) + - [Container-level memory QoS](#container-level-memory-qos) + - [Runtime classes](#runtime-classes) - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) @@ -102,8 +105,6 @@ SIG Architecture for cross-cutting KEPs). - [Scheduler](#scheduler) - [Kubectl](#kubectl) - [Container runtimes](#container-runtimes) - - [Open Questions](#open-questions) - - [Other Kubernetes-managed QoS-class resources](#other-kubernetes-managed-qos-class-resources) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) - [Unit tests](#unit-tests) @@ -507,26 +508,168 @@ kube-apiserver, kube-scheduler) to support QoS-class resources. ### User Stories (Optional) -#### Story 1 +#### Mitigating noisy neighbors As a user I want to minimize the interference of other applications to my -workload by assigning it to a class with exclusive cache allocation. +main application by assigning it to a class with exclusive cache allocation. I +also want to boost it's local disk I/O priority over other containers within +the pod. At the same time, I want to make sure my low-priority, I/O-intensive +background task running in a side-car container does not disturb the main +application. -#### Story 2 +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: qos-resource-example +spec: + containers: + - name: main-app + image: my-app + resources: + qosResources: + - name: rdt + class: exclusive + - name: blockio + class: high-prio + - name: sidecar + image: my-sidecar + resources: + qosResources: + - name: rdt + class: limited + - name: blockio + class: throttled + ... +``` + +#### Vendor-specific QoS + +As a vendor I want to implement custom QoS controls as an extension of the +container runtime. I want my QoS control to be visible in the cluster and +integrated e.g. in the Kubernetes sheduler and not rely e.g. on Pod annotations +to communicate QoS requests. + +#### Defaults and limits + +As a cluster administrator I want to set control what is the default QoS of +applications. I also want to configure different defaults for certain +Kubernetes namespaces. In addition I want to limit access to high-priority QoS +on certain namespaces. + +I set per-node defaults for in the node's system-level and runtime +configuration, outside Kubernetes. + +I set per-namespace defaults using LimitRanges: + +```yaml +apiVersion: v1 +kind: LimitRange +metadata: + name: qos-defaults +spec: + limits: + - type: Pod + qosResources: + - name: network + default: slow + - type: Container + qosResources: + - name: qos-foo + default: low-prio + - name: vendor.example/xyz-qos + default: normal +``` + +I set per-namespace constraints on the overall usage of QoS-class resources: + +```yaml +apiVersion: v1 +kind: ResourceQuota +metadata: + name: qos-constraints +spec: + qosResources: + pod: + - name: network + classes: + - name: slow # unlimited access + - name: normal + capacity: 2 # only two pods with this is allowed + #- name: high # class "high" is not allowed at all + container: + - name: qos-foo + classes: + - name: low-prio + - name: normal-prioa + #- name: high-prio # high-prio is not allowed at all + - name: vendor.example/xyz-qos + classes: + - name: low-prio + - name: normal + - name: high-prio + capacity: 1 # high-prio is allowed but with very limited usage +``` + +#### Possible future scenarios + +This section speculates on possible future uses of the QoS-class resources +mechanism. + +##### Kubernetes-managed QoS-class resources + +It would be possible to have QoS-class resources that would be managed by +Kubernetes/kubelet instead of the container runtime. If we specify and manage +official well-known QoS-class resource names in the API it would be possible to +specify Kubernetes-internal names that the container runtime would know to +ignore (or not try to manage itself). + +One possible usage-scenario would be pod-level cgroup controls, e.g. cgroup v2 +memory knobs in linux (see +[KEP-2570: Support Memory QoS with cgroups v2][kep-2570]. -As a user I want to make sure my low-priority, I/O-intensive background task -will not disturb more important workloads running on the same node. +##### Container-level memory QoS -#### Story 3 +Container runtimes could implement an admin-configurable support for memory +QoS, as an alternative to +[KEP-2570: Support Memory QoS with cgroups v2][kep-2570]. + +It would be easy to enable also for example `memory.swap.*` control knobs +available in Linux cgroups v2 to control swap-usage on a per-container basis. + +##### Runtime classes + +The QoS-class resources mechanism could be "abused" to replace (or +re-implement) the current runtime classes. + +One benefit of using a QoS-class resource to represent runtime classes would be +the "automatic" visibility to what runtime classes are available on each node. +E.g. adding a new runtime class on a node by re-configuration of the runtime +would be immediately reflected in the node status without any additional admin +tasks (like node labeling or manual creation of RuntimeClass objects). + +###### Splitting Pod QoS Class + +Currently the Pod QoS class is an implicit property of a Pod, tied to how the +resource requests and limits of its containers are specified. QoS-class +resources would make it possible for users to explicitly specify the Pod QoS +class in the PodSpec, making it possible e.g. to create a guaranteed pod with a +container whose memory limit is higher than the requests. -As a cluster administrator I want to throttle I/O bandwidths of certain -DaemonSets, and I want that exact throttling values depend on the SSD model in -my heterogenous cluster. +Taking this idea further, QoS-class resources would also make it possible to +split several properties implicit in the Pod QoS class (like eviction behavior +or OOM scoring) into separate properties. For example, in the case of OOM +scoring make it possible for the user to specify a burstable pod with low OOM +priority (low chance of being killed). -#### Story 4 +###### Pod priority class -As a user I want to assign a low priority task into an (RDT) class that limits -the available memory bandwidth. +One wild idea would be to implement pod priority class mechanism (or part of +it) as a QoS-class resource. This would be a +[Kubernetes-managed](#kubernetes-managed-qos-class-resources) pod-level +QoS-class resource. Likely benefits of using the QoS-class resources mechanism +would be to be able to set per-namespace defaults with LimitRanges and allow +permission-control to high-priority classes with ResourceQuotas. ### Notes/Constraints/Caveats (Optional) @@ -1198,20 +1341,6 @@ done via OCI. User interface is provided through pod and container annotations. Container runtimes will be updated to support the [CRI API extensions](#cri-api) -### Open Questions - -#### Other Kubernetes-managed QoS-class resources - -In addition to the Pod QoS class it would be possible to specify other -QoS-class resources that would be managed by Kubernetes/kubelet. If we specify -and manage official well-known QoS-class resource names in the API it would be -possible to specify Kubernetes-internal names that the container runtime would -know to ignore (or not try to manage itself). - -One possible usage-scenario would be pod-level cgroup controls, e.g. cgroup v2 -memory knobs in linux (see -[KEP-2570: Support Memory QoS with cgroups v2][kep-2570]. - ### Test Plan -Implementation phase 1: Unit test will be added to kubelet to test that -inspection of [pod annotations](#pod-annotations) is correctly disabled/enabled -with the feature gate. +Kubelet unit tests are extended to verify that no QoS-class resource +assignments are correctly passed down to the CRI API, also verifying that +assignments are not passed down if the feature gate is disabled. -Future implementation phases: unit tests for handling the changes in pod spec -are implemented. +Apiserver unit tests are extended to verify that the new fields in PodSpec are +preserved over updates of the Pod object, even when the feature gate is +disabled. ### Rollout, Upgrade and Rollback Planning @@ -1799,9 +1787,13 @@ rollout. Similarly, consider large clusters and how enablement/disablement will rollout across nodes. --> -Implementation Phase 1: we rely on inspection of pod annotations inside kubelet -which should make rollout/rollback failure-safe. Already running workloads are -not affected. +Implementation Phase 1: Already running workloads ahouls not be affected as the +QoS-class resources feature operates on new fields in the PodSpec. Bugs in +kubelet might cause containers fail to start, either by failing a pod admission +check that should pass or passing incorrect parameters to the container +runtime. Bugs in kube-scheduler might leave Pods in pending state (even though +they could be run in some node) or scheduling on an incorrect node, causing +kubelet todeny running the pod. Future implementation phases: TBD. @@ -1812,9 +1804,10 @@ What signals should users be paying attention to when the feature is young that might indicate a serious problem? --> -Implementation Phase 1: watch for non-ready pods with CreateContainerError +Implementation Phase 1: Watch for non-ready pods with CreateContainerError status. The error message will indicate the if the failure is related to -QoS-class resources. +QoS-class resources. Generally, pod events would be a good source for +determining if problems are related to QoS-class resources feature. Future implementation phases: TBD. @@ -1857,9 +1850,7 @@ checking if there are objects with field X set) may be a last resort. Avoid logs or events for this purpose. --> -Implementation Phase 1: by examining pod annotations. - -Future implementation phases: by examining the new fields in pod spec. +By examining the new fields in pod spec. ###### How can someone using this feature know that it is working for their instance? @@ -1996,11 +1987,11 @@ Describe them, providing: - Supported number of objects per namespace (for namespace-scoped objects) --> -Implementation Phase 1: No. +Implementation Phase 1: No. QoS-class resources do extend existing API types +but presumably not introduce new types of objects. -Future implementation phases: QoS-class resources do extend existing API types -but presumably not introduce new types of objects. However, the design for -resource discovery and permission control is not ready which might change this. +Future implementation phases: the design for resource discovery and permission +control is not ready which might change this. ###### Will enabling / using this feature result in any new calls to the cloud provider? @@ -2022,16 +2013,20 @@ Describe them, providing: - Estimated amount of new objects: (e.g., new Object X for every existing Pod) --> -Implementation Phase 1: [pod annotations](#pod-annotations) are used as the -initial user interface so assign QoS-class resources to containers. Exact size -of each annotation varies (depending on the type of resource) but the -annotation key is expected to be few tens of bytes. The value part is the name -of the class expected to be a few bytes long. +New fields in NodeStatus will slightly increase the size of Node objects if +QoS-class resources are present. This will consist of a few bytes per each type +of QoS-class resource (mostly from the name field) plus a few bytes per each +class (mostly from the name field). + +New fields in PodSpec will increase the size of Pod objects by a few bytes per +QoS requested if QoS-class resources are requested, The increase in size +basically consists of the name of the resource and name of the class. +Similarly, new fields in PodStatus will increase its size by a few bytes per +each type of QoS-class resource. -Future implementations: New fields in the pod spec will increase the size of -`Pod` objects by a few bytes per class requested. New fields will be added to -NodeStatus which will increase its size. New field will be added to -ResourceQuotaSpec increasing its size. +Future implementation phases: extensions to ResourceQuota and/or LimitRanges +objects will increase their sizes if limits on QoS-class resources are +specified by the user. ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? @@ -2167,7 +2162,7 @@ ensure that the annotations always reflect the actual assignment of QoS-class resources of a Pod. It also would serve as part of the UX to indicate the in-place updates of the resources via annotations is not supported. -#### Class capacity with extended resources +### Class capacity with extended resources Support for class capacity could be left out of the proposal to simplify the concept. It would be possible to implement "class capacity" by leveraging From f8bdc56cafa607850d22702173aa7a6756265f4a Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 16 May 2023 13:39:46 +0300 Subject: [PATCH 76/92] KEP-3008: refine the validation of names Refine the details of validation of QoS-resource names and class names. --- keps/sig-node/3008-qos-class-resources/README.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 04b08acf612..210061c9d64 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -1348,12 +1348,14 @@ in NodeStatus and handle QoS-class resource requests in the PodSpec. ### API server -Input validation of QoS-class resource names and class names, very similar to -labels is implemented: keys (`QOSResourceName`) and values must be non-empty, -less than 64 characters long, must start and end with an alphanumeric character -and may contain only alphanumeric characters, dashes, underscores or dots (`-`, -`_` or `.`). Also similar to labels, a namespace prefix (FQDN subdomain separated -with a slash) in the key is allowed (e.g. `vendor.example/qos-resource`). +Input validation of QoS-class resource names and class names is implemented. +They must be "qualified Kubernetes names", i.e. a name part optionally prefixed +by a DNS subdomain (and a slash). The name part must be non-empty less than 64 +characters long, must start and end with an alphanumeric character and may +contain only alphanumeric characters, dashes, underscores or dots (`-`, +`_` or `.`). The optional DNS subdomain part must not be moore than 253 +characters long, must start and end with an alphanumeric character and may only +contain lowercase alphanumeric characters, dashes (`-`) and dots (`.`). Official canonical names for well-known QoS-class resources are specified in the API. In later implementation phases admission (validation) for their usage From ebfc3149dd6ee40124a76fb1f7530ac37851c025 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Thu, 8 Jun 2023 15:49:39 +0300 Subject: [PATCH 77/92] KEP-3008: change status to implementable --- keps/sig-node/3008-qos-class-resources/kep.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/3008-qos-class-resources/kep.yaml b/keps/sig-node/3008-qos-class-resources/kep.yaml index a167c95c3ef..27edab1021e 100644 --- a/keps/sig-node/3008-qos-class-resources/kep.yaml +++ b/keps/sig-node/3008-qos-class-resources/kep.yaml @@ -4,7 +4,7 @@ authors: - "@marquiz" owning-sig: sig-node participating-sigs: [] -status: provisional +status: implementable creation-date: 2021-10-07 reviewers: - "@rphillips" From 084a0a72fa3291c2229b94ccf62fd64f6d6c331d Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Thu, 15 Jun 2023 21:42:19 +0300 Subject: [PATCH 78/92] KEP-3008: update - fix typos - small update to future work/kubelet-initiated pod eviction - move "passing down pod qos class" to future work, out of the main proposal --- .../3008-qos-class-resources/README.md | 76 +++++++++---------- 1 file changed, 37 insertions(+), 39 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 210061c9d64..38223b11892 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -81,6 +81,7 @@ SIG Architecture for cross-cutting KEPs). - [Vendor-specific QoS](#vendor-specific-qos) - [Defaults and limits](#defaults-and-limits) - [Possible future scenarios](#possible-future-scenarios) + - [Pass on Pod QoS class to the runtime](#pass-on-pod-qos-class-to-the-runtime) - [Kubernetes-managed QoS-class resources](#kubernetes-managed-qos-class-resources) - [Container-level memory QoS](#container-level-memory-qos) - [Runtime classes](#runtime-classes) @@ -100,7 +101,6 @@ SIG Architecture for cross-cutting KEPs). - [PodStatus](#podstatus) - [NodeStatus](#nodestatus) - [Consts](#consts) - - [Pod QoS class](#pod-qos-class) - [Kubelet](#kubelet) - [API server](#api-server) - [Scheduler](#scheduler) @@ -415,12 +415,12 @@ of QoS-class resources on the nodes. QoS-class resources available on a node are dynamic in the sense that they may change over the lifetime of the node. E.g. re-configuration of the container runtime may make new types of QoS-class resources available, properties of -existing resources may changes (e.g. the set of available classes) or some -resources might be removed completely. It might be desirable that kubelet could -evicts running pod that request QoS-class resources that are no more available -on the node. This should be relatively straightforward to implement as kubelet -knows what QoS-class resources are available on the node and also monitors all -running pods. +existing resources may changes (e.g. the set of available classes or their +capacity) or some resources might be removed completely. It might be desirable +that kubelet could evict a running pod that request QoS-class resources that +are no more available on the node. This should be relatively straightforward to +implement as kubelet knows what QoS-class resources are available on the node +and also monitors all running pods. #### Default and limits @@ -639,6 +639,35 @@ spec: This section speculates on possible future uses of the QoS-class resources mechanism. +##### Pass on Pod QoS class to the runtime + +The [Pod QoS class][pod-qos-class] could be communicated to the container +runtime as a special Kubernetes-specific QoS-class resource. + +Information about Pod QoS class is currently internal to kubelet and not +visible to the container runtime. However, container runtimes (CRI-O, at least) +are already depending on this information and currently determining it +indirectly by evaluating other CRI parameters. + +This change would make the information about Pod QoS explicit and would allow +elimination of unreliable code paths in runtimes, for example. + +Pod QoS class would not be advertised as one of the available QoS-class +resources in NodeStatus. Also, users would not be allowed to request it in the +PodSpec which would be enforced by admission checks in the api-server and +kubelet. In other words, this would be purely informational, aimed for the +container runtime. + +```diff ++ // Kubernetes-managed QoS resources. These are only informational to the runtime ++ // and no CRI requests should fail because of these. ++const ( ++ // QOSResourcePodQOS is the Kubernetes Pod Quality of Service Class. ++ // Possible values are "Guaranteed", "Burstable" and "BestEffort". ++ QOSResourcePodQOS = "pod-qos" ++) +``` + ##### Kubernetes-managed QoS-class resources It would be possible to have QoS-class resources that would be managed by @@ -750,7 +779,7 @@ hogging all available high-priority QoS. QoS-class resources the Pod will stay in pending state, with Pod status indicating that no nodes are available with a message describing the detailed reason of unavailable container-level QoS-class resource type. -- User mistakenly specifies a scontainer-level QoS-class resources as a +- User mistakenly specifies a container-level QoS-class resources as a pod-level request. This is not regarded as an error or misconfiguration but the request is regarded as a pod-wide default for all containers. However, an unintended consequence may be that some containers are run @@ -1272,37 +1301,6 @@ Also, kubelet will reject the registration of unknown QoS-class resources in the "official" namespace. Custom/vendor-specific QoS-class resources will still be allowed outside the "official" namespace. -### Pod QoS class - -`<<[UNRESOLVED]>>` - -The [Pod QoS class][pod-qos-class] will be communicated to the container -runtime as a special Kubernetes-specific QoS-class resource. - -Information about Pod QoS class is currently internal to kubelet and not -visible to the container runtime. However, container runtimes (CRI-O, at least) -are already depending on this information and currently determining it -indirectly by evaluating other CRI parameters. - -This change makes the information about Pod QoS explicit and allows elimination -of unreliable code paths in runtimes, for example. - -Pod QoS class will not advertised as one of the available QoS-class resources -in NodeStatus. Also, users are not allowed to request it in the PodSpec which -will be enforced by admission checks in the api-server and kubelet. - -```diff -+ // Kubernetes-managed QoS resources. These are only informational to the runtime -+ // and no CRI requests should fail because of these. -+const ( -+ // QOSResourcePodQOS is the Kubernetes Pod Quality of Service Class. -+ // Possible values are "Guaranteed", "Burstable" and "BestEffort". -+ QOSResourcePodQOS = "pod-qos" -+) -``` - -`<<[/UNRESOLVED]>>` - ### Kubelet Kubelet gets QoS-class resource assignment from the [PodSpec](#podspec) and From 4c3638ca9dd6317348b9cc220cbd8feef5011483 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Fri, 16 Jun 2023 15:07:19 +0300 Subject: [PATCH 79/92] KEP-3008: update kep template and test plan --- .../3008-qos-class-resources/README.md | 35 +++++++++++++++---- 1 file changed, 29 insertions(+), 6 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 38223b11892..f402ce469bf 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -158,10 +158,10 @@ Items marked with (R) are required *prior to targeting to a milestone / release* - [ ] (R) Design details are appropriately documented - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - [ ] e2e Tests for all Beta API Operations (endpoints) - - [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free - [ ] (R) Graduation criteria is in place - - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [ ] (R) Production readiness review completed - [ ] (R) Production readiness review approved - [ ] "Implementation History" section is up-to-date for milestone @@ -1447,6 +1447,8 @@ Based on reviewers feedback describe what additional tests need to be added prio implementing this enhancement to ensure the enhancements have also solid foundations. --> +No prerequisites have been identified. + ##### Unit tests -- `k8s.io/kubernetes/pkg/kubelet/kuberuntime`: `2022-06-13` - `66.8%` -- `k8s.io/kubernetes/pkg/apis/core/validation/validation.go`: `2022-06-13` - `82.1%` -- `k8s.io/kubernetes/pkg/scheduler` +- `k8s.io/kubernetes/pkg/kubelet/kuberuntime`: `2023-06-16` - `66.§%` +- `k8s.io/kubernetes/pkg/apis/core/validation`: `2023-06-16` - `83.5%` +- `k8s.io/kubernetes/pkg/scheduler/framework`: `2023-06-16` - `77.9%` ##### Integration tests @@ -1709,7 +1711,7 @@ well as the [existing list] of feature gates. - Will enabling / disabling the feature require downtime of the control plane? - Will enabling / disabling the feature require downtime or reprovisioning - of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). + of a node? ###### Does enabling the feature change any default behavior? @@ -2055,6 +2057,27 @@ This through this both in small and large cases, again with respect to the No, this is not expected. +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +Not on Kubernetes side. That should not happen as the feature is about API +changes and related changes in the Kubernetes components, requiring no new +resources (except for a tiny amount of cpu and memory becauxe of the added +code). + +However, a severely buggy implementation on the CRI runtime side might cause +resource exhaustion (of basically any kind). In this case the only plausible +mitigation on Kubernetes side is to disable the feature. + ### Troubleshooting +### Dynamic Resource Allocation (DRA) + +[DRA][dra-kep] provides an API for requesting and sharing resources between +pods and containers. + +DRA is designed for allocating resources, not expressing QoS. Or put +differently, designed for quantitative rather than qualitative resources. +Specifically, it is targeting accelerator devices with potentially complex +paraemterization and lifecycle state management for both workloads and the +devices themselves. All of this brings complexity and overhead that is not an +issue with the intended usage where only a small fraction of workloads request +devices (via DRA). However, this becomes significant when scaling to say +hundreds of nodes with hundreds of pods per node, all of which are potentially +requesting (e.g. via defaults) multiple classes of multiple types of QoS. + +The QoS-class resources mechanism if following the existing conventions and +design patterns (with high code re-use in the implementation) of existing +native and extended resources. This includes support for setting defaults and +usage restrictions via existing LimitRanges and ResourceQuota mechanisms which +would be non-trivial and costly to implement with DRA on per-QoS-class level. + ### Pod annotations Instead of updating CRI and Kubernetes API in lock-step, the API changes could @@ -2271,3 +2303,4 @@ required. [kep-2570]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos [oci-runtime-rdt]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/config-linux.md#IntelRdt [pod-qos-class]: https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/ +[dra-kep]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation From 76d4acdf6b0f3a616cd1325db9ba9ebc70712391 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Thu, 21 Sep 2023 15:42:11 +0300 Subject: [PATCH 82/92] KEP-3008: update user stories - add explicit Pod QoS request from the user as a possible usage for K8s-managed QoS. - reword the ulimit usage scenario --- keps/sig-node/3008-qos-class-resources/README.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index d07403e2172..2a100807e58 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -695,6 +695,10 @@ One possible usage-scenario would be pod-level cgroup controls, e.g. cgroup v2 memory knobs in linux (see [KEP-2570: Support Memory QoS with cgroups v2][kep-2570]. +Another possible usage could be to allow the user to explicitly specify the +desired Pod QoS class of the application (instead of implicitly deriving it +from the resource requests/limits). + ##### Container-level memory QoS Container runtimes could implement an admin-configurable support for memory @@ -2175,7 +2179,7 @@ devices (via DRA). However, this becomes significant when scaling to say hundreds of nodes with hundreds of pods per node, all of which are potentially requesting (e.g. via defaults) multiple classes of multiple types of QoS. -The QoS-class resources mechanism if following the existing conventions and +The QoS-class resources mechanism follows the existing conventions and design patterns (with high code re-use in the implementation) of existing native and extended resources. This includes support for setting defaults and usage restrictions via existing LimitRanges and ResourceQuota mechanisms which From 58bc55126e9026c122f9d93554b8ffb29a4801d2 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 26 Sep 2023 14:49:50 +0300 Subject: [PATCH 83/92] KEP-3008: update ulimit user story --- keps/sig-node/3008-qos-class-resources/README.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 2a100807e58..dda7abb39d2 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -642,12 +642,16 @@ spec: #### Set ulimits -As a cluster administrator I want to enable per-application control of -Linux/UNIX ulimits. However, I don't want to give the users full control of the -parameters but specify pre-defined classes of limits. - -One possible way to implement the this would be using -[NRI](https://github.com/containerd/nri). +As a cluster administrator I want to enable per-application control of system +resource usage /such as maximum number of processes or open files) exposed as +ulimits/rlimits in a Linux/UNIX system. However, I don't want to give the users +full control of the exact numeric parameters but specify pre-defined classes of +limits for different workload profiles. + +Currently, ulimits can be set via an +[NRI plugin](https://github.com/containerd/nri/tree/main/plugins/ulimit-adjuster) +but that relies on Pod annotations. Using QoS-class resources would give better +user interface and control to the cluster administrator. #### Possible future scenarios From d71e8c353a576f7f6606b513c90ad77d49b67ef0 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 3 Oct 2023 11:00:45 +0300 Subject: [PATCH 84/92] KEP-3008: address review comments from logicalhan - typo and text formatting fixes - Capacity vs. Ceiling in QOSResourceClassLimit marked as unresolved - move admission/enforcement of QoS resources in "official" ns from GA to Beta - PRR: added kube-scheduler to the list of unit tests to update --- .../3008-qos-class-resources/README.md | 31 ++++++++++++------- 1 file changed, 20 insertions(+), 11 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index dda7abb39d2..5a16a8e514b 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -219,7 +219,7 @@ the Kubernetes resource model with a new type of resources, i.e. QoS-class resources. This KEP identifies two technologies that can immediately be enabled with -QoS-class resources. However, these are just two examples and the proposed +QoS-class resources. However, these are just two examples and the proposed changes are generic (and not tied to these two QoS-class resource types in any way), making it easier to implement new QoS-class resource types. @@ -370,7 +370,7 @@ would implement restrictions based on the namespace. + // Pod contains the allowed QoS resources for pods. + // +optional + Pod []AllowedQOSResource -+ // Container contains the allowed QoS resources for pods. ++ // Container contains the allowed QoS resources for containers. + // +optional + Container []AllowedQOSResource +} @@ -468,6 +468,12 @@ usage of container-level QoS-class resources. } ``` +`<<[UNRESOLVED @logicalhan]>>` + +Use field name `Ceiling` instead `Capacity` in QOSResourceClassLimit. + +`<<[/UNRESOLVED]>>` + Not supporting Max (i.e. only supporting Default) in LimitRanges could simplify the API. @@ -1318,9 +1324,9 @@ application specific QoS implementations. +) ``` -In later implementation phases (GA) admission control (validation) is added to +In later implementation phases (Beta) admission control (validation) is added to reject requests for unknown QoS-class resources in the "official" namespace. -Also, kubelet will reject the registration of unknown QoS-class resources in +Also (in Beta), kubelet will reject the registration of unknown QoS-class resources in the "official" namespace. Custom/vendor-specific QoS-class resources will still be allowed outside the "official" namespace. @@ -1606,6 +1612,10 @@ in back-to-back releases. - Extend CRI API to support updating sandbox-level QoS-class resources - Permission control (ResourceQuota etc) - Well-defined behavior with [In-place pod vertical scaling](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources) +- Validation of QoS-class resources in the "official" namespace: + - enforce admission control on "official" QoS-class resource names + - kubelet rejects registration of unknown QoS-class resorce names in the + "official" namespace - Integration with RuntimeClasses - Additional tests are in Testgrid and linked in KEP - User documentation is available @@ -1614,10 +1624,6 @@ in back-to-back releases. - More rigorous forms of testing—e.g., downgrade tests and scalability tests - Allowing time for feedback -- In addition to beta: - - enforce admission control on "official" QoS-class resource names - - kubelet rejects registration of unknown QoS-class resorce names in the - "official" namespace ### Upgrade / Downgrade Strategy @@ -1794,6 +1800,9 @@ Apiserver unit tests are extended to verify that the new fields in PodSpec are preserved over updates of the Pod object, even when the feature gate is disabled. +Scheduler unit tests are extended to verify that QoS-class resources are +correctly taken into account in node fitting. + ### Rollout, Upgrade and Rollback Planning -Implementation Phase 1: Already running workloads ahouls not be affected as the +Implementation Phase 1: Already running workloads should not be affected as the QoS-class resources feature operates on new fields in the PodSpec. Bugs in kubelet might cause containers fail to start, either by failing a pod admission check that should pass or passing incorrect parameters to the container @@ -2216,7 +2225,7 @@ translate them into corresponding `QOSResources` data in the CRI ContainerConfig message at container creation time (CreateContainerRequest). Pod-level QoS-class would not supported at this point (via pod annotations). -A feature gate `TranslateQoSPodMetadata` would enable kubelet to interpretthe +A feature gate `TranslateQoSPodMetadata` would enable kubelet to interpret the specific pod annotations. If the feature gate is disabled the annotations would simply be ignored by kubelet. @@ -2235,7 +2244,7 @@ Support for class capacity could be left out of the proposal to simplify the concept. It would be possible to implement "class capacity" by leveraging extended resources and mutating admission webhooks: -1. An extended resource with the desited capacity is created for each class +1. An extended resource with the desired capacity is created for each class which needs to be controlled. A possible example: ```plaintext Allocatable: From bf6c498f98b0c16945f62365349b3a89284ada52 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Wed, 4 Oct 2023 09:58:49 +0300 Subject: [PATCH 85/92] KEP-3008: response to review feedback from logicalhan - PRR: state that existing metrics for kube-scheduler and kubelet can be used - alternatives: mention validating admission policies - kep.yaml: update milestones - kep.yaml: update feature gate name to QOSResources, to be in sync with that used in README.md --- keps/sig-node/3008-qos-class-resources/README.md | 16 +++++++++++++--- keps/sig-node/3008-qos-class-resources/kep.yaml | 6 +++--- 2 files changed, 16 insertions(+), 6 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 5a16a8e514b..c7772020779 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -1373,6 +1373,13 @@ phase. A feature gate QOSResources enables kubelet to update the QoS-class resources in NodeStatus and handle QoS-class resource requests in the PodSpec. +`<<[UNRESOLVED @logicalhan]>>` + +Use feature gate name `PrioritizedResources` instead of `QOSResources`. If when +changed, needs to sync with all of the naming in the KEP. + +`<<[/UNRESOLVED]>>` + ### API server Input validation of QoS-class resource names and class names is implemented. @@ -1841,7 +1848,9 @@ that might indicate a serious problem? Implementation Phase 1: Watch for non-ready pods with CreateContainerError status. The error message will indicate the if the failure is related to QoS-class resources. Generally, pod events would be a good source for -determining if problems are related to QoS-class resources feature. +determining if problems are related to QoS-class resources feature. For +kube-scheduler and kubelet the existing metrics (e.g. +`kubelet_started_containers_errors_total`) can be used. Future implementation phases: TBD. @@ -2270,7 +2279,9 @@ solution. Downsides include: - requires implementation of "side channel" control mechanisms, e.g. admission webhook and some solution for capacity management (extended resources) -- deployment of admission webhooks is cumbersome +- deployment of admission webhooks is cumbersome (validating admission policies + can mitigate this somewhat if/when support for mutating admission is + implemented) - management of capacity is limited and cumbersome - management of extended resources needs a separate mechanism - ugly to handle a scenario where *class-A* on *node-1* would be unlimited @@ -2282,7 +2293,6 @@ solution. Downsides include: - possible confusion for users regarding the double accounting (QoS-class resources and extended resources) - ### RDT-only The scope of the KEP could be narrowed down by concentrating on RDT only, diff --git a/keps/sig-node/3008-qos-class-resources/kep.yaml b/keps/sig-node/3008-qos-class-resources/kep.yaml index 27edab1021e..2ab2454f104 100644 --- a/keps/sig-node/3008-qos-class-resources/kep.yaml +++ b/keps/sig-node/3008-qos-class-resources/kep.yaml @@ -17,16 +17,16 @@ stage: alpha # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.28" +latest-milestone: "v1.29" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: - alpha: "v1.28" + alpha: "v1.29" # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled feature-gates: - - name: ClassResources + - name: QOSResources components: - kubelet - kube-apiserver From fb0d6e02116b40f94440af60413b80cee1727547 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Mon, 30 Oct 2023 15:06:43 +0200 Subject: [PATCH 86/92] KEP-3008: address review feedback from swatisehgal - fix typos - clarify capacity vs. availability vs. usage in future scheduler work - elaborate on Kubernetes-managed QoS resources --- .../3008-qos-class-resources/README.md | 24 +++++++++++++------ 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index c7772020779..ce1570c1de9 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -317,8 +317,8 @@ resources and start experimenting with them in Kubernetes: - extend the CRI protocol to allow runtime to communicate available QoS-class resources (the types of resources and the classes within) to kubelet - introduce a feature gate for enabling QoS-class resource support in kubelet -- extend PodSpec to support assignment of QoS-clsss resources -- extend NodeStatus to show availability/capacity of QoS-clsss resources on a +- extend PodSpec to support assignment of QoS-class resources +- extend NodeStatus to show availability/capacity of QoS-class resources on a node ### Future work @@ -413,8 +413,10 @@ would implement restrictions based on the namespace. The first implementation phase only adds basic filtering of nodes based on QoS-class resources and node scoring is not altered. However, the relevant scheduler plugins (e.g. NodeResourcesFit and NodeResourcesBalancedAllocation) -could be extended to do scoring based on the capacity, availability and usage -of QoS-class resources on the nodes. +could be extended to do scoring based on the capacity (maximum number of +assignments to a class), availability (how much of that still is available, +i.e. capacity minus the current number of assignments) and usage (the current +number of assignments to a class) of QoS-class resources on the nodes. #### Kubelet-initiated pod eviction @@ -522,12 +524,12 @@ QoS-class resources from the runtime to the client. This information includes: QoS-class resources may be bounded in the way that the number of applications that can be assigned to a specific class (of one QoS-class resource) on a node -can be limited. This limit is configuratble on a per-class (and per-node) +can be limited. This limit is configurable on a per-class (and per-node) basis. This can be used to e.g. limit access to a high-tier class. Pod-level and container-level QoS-class resources are independent resource types. However, specifying a container-level QoS-class resource something in -the pod-level request in PodSpec will be regarded by Kubernetres as a default +the pod-level request in PodSpec will be regarded by Kubernetes as a default for all containers of the pod. Currently we identify two types of container-level QoS-class resources (RDT and @@ -699,7 +701,9 @@ It would be possible to have QoS-class resources that would be managed by Kubernetes/kubelet instead of the container runtime. If we specify and manage official well-known QoS-class resource names in the API it would be possible to specify Kubernetes-internal names that the container runtime would know to -ignore (or not try to manage itself). +ignore (or not try to manage itself). E.g. any QoS-class resources with +`k8s.io/` prefix could be treated as Kubernetes-managed and ignored by the +container runtime. One possible usage-scenario would be pod-level cgroup controls, e.g. cgroup v2 memory knobs in linux (see @@ -1310,6 +1314,12 @@ without a `/` prefix). Namespaced (or fully qualified) names like `example.com/acme-qos` are not controlled and are meant for e.g. vendor or application specific QoS implementations. +The `k8s.io/` prefix is reserved for possible future +[Kubernetes-managed QoS-class resources](#kubernetes-managed-qos-class-resources). +Runtimes are not allowed to register QoS-class resources with `k8s.io/` prefix. +Runtimes should treat any QoS-class resource with `k8s.io/` as ones managed by +Kubernetes and consider assignments as informational-only. + `<<[/UNRESOLVED]>>` ```diff From 2cbaa023e4169625fcee1ba677b076bd5327f61e Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Thu, 2 Nov 2023 15:55:22 +0200 Subject: [PATCH 87/92] KEP-3008: change naming of "official" QoS resources --- .../3008-qos-class-resources/README.md | 40 +++++++++++-------- 1 file changed, 23 insertions(+), 17 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index ce1570c1de9..1fcfee53e63 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -701,9 +701,10 @@ It would be possible to have QoS-class resources that would be managed by Kubernetes/kubelet instead of the container runtime. If we specify and manage official well-known QoS-class resource names in the API it would be possible to specify Kubernetes-internal names that the container runtime would know to -ignore (or not try to manage itself). E.g. any QoS-class resources with -`k8s.io/` prefix could be treated as Kubernetes-managed and ignored by the -container runtime. +ignore (or not try to manage itself). E.g. any non-namespaced QoS-class +resources (one without `/` prefix in the name) would be treated as +Kubernetes-managed and ignored by the container runtime. See [Consts](#consts) +section below for details about QoS-class resource naming. One possible usage-scenario would be pod-level cgroup controls, e.g. cgroup v2 memory knobs in linux (see @@ -1309,17 +1310,21 @@ across different implementations. `<<[UNRESOLVED @sftim]>>` -The canonical Kubernetes names for QoS-class resources are non-namespaced (i.e. -without a `/` prefix). Namespaced (or fully qualified) names like +The canonical Kubernetes names for QoS-class resources come in two variants: + +- The `k8s.io/` prefix is reserved for "official" well-known runtime-managed + QoS resources. +- Non-namespaced (i.e. without a `/` prefix) names are reserved for + possible future + [Kubernetes-managed QoS-class resources](#kubernetes-managed-qos-class-resources). + Runtimes are not allowed to register QoS-class resources with `k8s.io/` + prefix. Runtimes should treat any non-namespaced QoS-class resource with as + ones managed by Kubernetes and consider assignments as informational-only. + +Namespaced (or fully qualified) names outside `k8s.io/` like `example.com/acme-qos` are not controlled and are meant for e.g. vendor or application specific QoS implementations. -The `k8s.io/` prefix is reserved for possible future -[Kubernetes-managed QoS-class resources](#kubernetes-managed-qos-class-resources). -Runtimes are not allowed to register QoS-class resources with `k8s.io/` prefix. -Runtimes should treat any QoS-class resource with `k8s.io/` as ones managed by -Kubernetes and consider assignments as informational-only. - `<<[/UNRESOLVED]>>` ```diff @@ -1327,18 +1332,19 @@ Kubernetes and consider assignments as informational-only. + // QOSResourceRdt is the name of the QoS-class resource named IntelRDT + // in the OCI runtime spec and interfaced through the resctrlfs + // pseudp-filesystem in Linux. This is a container-level reosurce. -+ QOSResourceIntelRdt = "rdt" ++ QOSResourceIntelRdt = "k8s.io/rdt" + // QOSResourceBlockio is the name of the blockio QoS-class resource. + // This is a container-level resource. -+ QOSResourceBlockio = "blockio" ++ QOSResourceBlockio = "k8s.io/blockio" +) ``` In later implementation phases (Beta) admission control (validation) is added to -reject requests for unknown QoS-class resources in the "official" namespace. -Also (in Beta), kubelet will reject the registration of unknown QoS-class resources in -the "official" namespace. Custom/vendor-specific QoS-class resources will still -be allowed outside the "official" namespace. +reject requests for unknown QoS-class resources in the "official" namespaces +(unprefixed or `k8s.io/`). Also (in Beta), kubelet will reject the registration +of unknown QoS-class resources in the "official" namespaces (unprefixed or +`k8s.io/`). Custom/vendor-specific QoS-class resources will still be allowed +outside the "official" namespaces. ### Kubelet From f2001a46a4b9a6efc699232c0bd9afcf5e4989fc Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Thu, 4 Jan 2024 16:17:56 +0200 Subject: [PATCH 88/92] KEP-3008: update - update k8s version in kep.yaml - update the container runtimes section, mention NRI API - add Cluster autoscaler in the future work - Motivation: mention NRI API and a small rewording - clarify goals - small updates on user stories --- .../3008-qos-class-resources/README.md | 70 ++++++++++--------- .../3008-qos-class-resources/kep.yaml | 4 +- 2 files changed, 40 insertions(+), 34 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 1fcfee53e63..65ca0d2df01 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -74,6 +74,7 @@ SIG Architecture for cross-cutting KEPs). - [Scheduler improvements](#scheduler-improvements) - [Kubelet-initiated pod eviction](#kubelet-initiated-pod-eviction) - [Default and limits](#default-and-limits) + - [Cluster autoscaler](#cluster-autoscaler) - [API objects for resources and classes](#api-objects-for-resources-and-classes) - [Proposal](#proposal) - [User Stories (Optional)](#user-stories-optional) @@ -221,7 +222,9 @@ resources. This KEP identifies two technologies that can immediately be enabled with QoS-class resources. However, these are just two examples and the proposed changes are generic (and not tied to these two QoS-class resource types in any -way), making it easier to implement new QoS-class resource types. +way), making it easier to implement new QoS-class resource types. For example, +the [NRI API][nri-api] would be good mechanism to implement new QoS-class +resources. [Intel RDT][intel-rdt] implements a class-based mechanism for controlling the cache and memory bandwidth QoS of applications. All processes in the same @@ -251,28 +254,24 @@ annotations on a Kubernetes Pod. The goal of this KEP is to get these types of resources first class citizens and properly supported in Kubernetes, providing visibility, a well-defined user interface, and permission controls. - - -We can identify two types, container-level and pod-level QoS-class resources. -Container-level resources enable QoS on per-container granularity, for example -container-level cgroups in Linux or cache and memory bandwidth control -technologies. Examples for pod-level QoS include e.g. pod-level cgroups or -network QoS that cannot support per-container granularity. +Two types of QoS-class resources are identified, container-level and pod-level +QoS-class resources. Container-level resources enable QoS on per-container +granularity, for example container-level cgroups in Linux or cache and memory +bandwidth control technologies. Examples for pod-level QoS include e.g. +pod-level cgroups or network QoS that cannot support per-container granularity. ### Goals -- Make it possible to request QoS-class resources - - Support RDT class assignment of containers. This is already supported by - the containerd and CRI-O runtime and part of the OCI runtime-spec - - Support blockio class assignment of containers. - - Support Pod-level (sandbox-level) QoS-class resources -- Make the API to support updating QoS-class resource assignment of running containers -- Make the extensions flexible, enabling simple addition of other QoS-class - resource types in the future. -- Make QoS-class resources opaque (as possible) to the CRI client -- Discovery of the available QoS-class resources -- Resource status/capacity +- Make it possible to request QoS-class resources from the PodSpec + - Container-level QoS-class resources + - Pod-level (sandbox-level) QoS-class resources +- Make it simple to implement new types QoS-class resource +- Make QoS-class resources opaque (as possible) to Kubernetes +- Support automatic discovery of the available QoS-class resources +- Support per-node status/capacity of QoS-class resources - Access control ([future work](#future-work)) +- Support updating QoS-class resource assignment of running containers + ([future work](#in-place-pod-vertical-scaling)) ### Non-Goals @@ -479,6 +478,13 @@ Use field name `Ceiling` instead `Capacity` in QOSResourceClassLimit. Not supporting Max (i.e. only supporting Default) in LimitRanges could simplify the API. +#### Cluster autoscaler + +The cluster autoscaler support will be extended to support QoS-class resources. +The behavior will be comparable to extended resources. The expectation would be +that all nodes in a node group would have an identical set of QoS-class +resources. + #### API objects for resources and classes `<<[UNRESOLVED]>>` @@ -585,7 +591,8 @@ spec: As a vendor I want to implement custom QoS controls as an extension of the container runtime. I want my QoS control to be visible in the cluster and integrated e.g. in the Kubernetes sheduler and not rely e.g. on Pod annotations -to communicate QoS requests. +to communicate QoS requests. I will implement my QoS-class resources as an +[NRI API][nri-api] plugin. #### Defaults and limits @@ -1458,19 +1465,17 @@ Container QoS resources: ### Container runtimes -Currently, there is support (container-level QoS-class resources) for Intel RDT -and blockio in CRI-O and containerd runtimes: - -- cri-o: - - [~~Add support for Intel RDT~~](https://github.com/cri-o/cri-o/pull/4830) - - [~~Support for cgroups blockio~~](https://github.com/cri-o/cri-o/pull/4873) -- containerd: - - [~~Support Intel RDT~~](https://github.com/containerd/containerd/pull/5439) - - [~~Support for cgroups blockio~~](https://github.com/containerd/containerd/pull/5490) +There is support (container-level QoS-class resources) for Intel RDT +and blockio in CRI-O ([~~#4830~~](https://github.com/cri-o/cri-o/pull/4830), +[~~#4873~~](https://github.com/cri-o/cri-o/pull/4873)) and containerd +([~~#5439~~](https://github.com/containerd/containerd/pull/5439), +[~~#5490~~](https://github.com/containerd/containerd/pull/5490)) runtimes. +The current user interface is provided through pod and container annotations. +The plan is to start using QoS-class resources instead of annotations. -The design paradigm here is that the container runtime configures the QoS-class -resources according to a given configuration file. Enforcement on containers is -done via OCI. User interface is provided through pod and container annotations. +The plan is also to extend the [NRI API][nri-api] +(Node Resource Interface) to support QoS-class resources, allowing for example +the implementation of new types of QoS-class resources as NRI plugins. Container runtimes will be updated to support the [CRI API extensions](#cri-api) @@ -2347,3 +2352,4 @@ required. [oci-runtime-rdt]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/config-linux.md#IntelRdt [pod-qos-class]: https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/ [dra-kep]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation +[nri-api]: https://github.com/containerd/nri diff --git a/keps/sig-node/3008-qos-class-resources/kep.yaml b/keps/sig-node/3008-qos-class-resources/kep.yaml index 2ab2454f104..313131ba42d 100644 --- a/keps/sig-node/3008-qos-class-resources/kep.yaml +++ b/keps/sig-node/3008-qos-class-resources/kep.yaml @@ -17,11 +17,11 @@ stage: alpha # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.29" +latest-milestone: "v1.30" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: - alpha: "v1.29" + alpha: "v1.30" # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled From 7214ab4cbdeb4837be7c3602ca10d237f149d2e7 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Tue, 9 Jan 2024 19:39:31 +0200 Subject: [PATCH 89/92] KEP-3008: PRR: add details on rollback failures --- keps/sig-node/3008-qos-class-resources/README.md | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 65ca0d2df01..3cbeb0e6e6e 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -1795,11 +1795,24 @@ NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`. Yes it can. Running workloads continue to work without any changes. Restarting or -re-deploying a workload causes it to fail as the requested QoS-class resources +re-deploying a workload may cause it to fail as the requested QoS-class resources are not available in Kubernetes anymore. The resources are still supported by the underlying runtime but disabling the feature in Kubernetes makes them unavailable and the related PodSpec fields are not accepted in validation. +In particular: + +- kubelet will ignore any QoS-class resources specified in the PodSpec if the + feature is disabled. New pods and containers (re-)created on the node + effectively have no QoS set as the information about QoS-class resource + assignments is not propagated to the CRI runtime. +- kube-scheduler will ignore any QoS-class resource requests if the feature is + disabled. This may cause pod admission failures on the node if the feature + is enabled in kubelet. +- kube-apiserver will reject creation of new pods with QoS-class resource + requests if the feature is disabled. This may cause e.g. failures in + re-deployment of an application that previously succeeded. + ###### What happens if we reenable the feature if it was previously rolled back? Workloads might have failed because of unsupported fields in the pod spec and From 1c6b5298965ac75f26086774f907dc73a761e8ff Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Fri, 26 Jan 2024 18:49:38 +0200 Subject: [PATCH 90/92] KEP-3008: small update to summary --- keps/sig-node/3008-qos-class-resources/README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 3cbeb0e6e6e..2d89d0d2cb2 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -197,6 +197,8 @@ aimed at enabling) are: - multiple containers can be assigned to the same class of a certain type of QoS-class resource +- classes are mutually exclusive - a given entity (pod or container) can only + have one class assigned for each type of QoS-class resource resource - QoS-class resources are represented by an enumerable set of class identifiers - each type of QoS-class resource has an independent set of class identifiers From 251b0ca3e1b92b350e18cfe2b4111ae1b7ecb1de Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Thu, 1 Feb 2024 21:03:09 +0200 Subject: [PATCH 91/92] KEP-3008: update cluster autoscaler Addressing review feedback from thocking. --- keps/sig-node/3008-qos-class-resources/README.md | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 2d89d0d2cb2..930e3c2eaaa 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -483,9 +483,19 @@ the API. #### Cluster autoscaler The cluster autoscaler support will be extended to support QoS-class resources. -The behavior will be comparable to extended resources. The expectation would be -that all nodes in a node group would have an identical set of QoS-class -resources. +The expectation would be that all nodes in a node group would have an identical +set of QoS-class resources (same resource types, same set of classes with the +same capacity for each). + +The functionality will be comparable to extended resources. The cluster +autoscaler uses the Kubernetes scheduler framework for running simulations. +Scaling up node groups with one or more nodes works practically without any +changes as the Kubernetes scheduler handles QoS-class resources and the cluster +autoscaler can correctly run the simulation based on the QoS-class resources +available on the existing node(s) of the node group. However, to support +scaling of empty node groups needs to be worked on, implementing specific +mechanisms (e.g. annotations) for each infrastructure provider to inform the +autoscaler about the QoS-class resources of node groups. #### API objects for resources and classes From fc082b2691fb77558983ee6ef1a9445ecf07ca85 Mon Sep 17 00:00:00 2001 From: Markus Lehtonen Date: Wed, 10 Apr 2024 12:21:20 +0300 Subject: [PATCH 92/92] KEP-3008: update - bump versions in kep.yaml - fix typos - added a use case for per-container OOM kill behavior --- keps/sig-node/3008-qos-class-resources/README.md | 6 +++++- keps/sig-node/3008-qos-class-resources/kep.yaml | 4 ++-- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/keps/sig-node/3008-qos-class-resources/README.md b/keps/sig-node/3008-qos-class-resources/README.md index 930e3c2eaaa..2bdf0bc80b3 100644 --- a/keps/sig-node/3008-qos-class-resources/README.md +++ b/keps/sig-node/3008-qos-class-resources/README.md @@ -495,7 +495,7 @@ autoscaler can correctly run the simulation based on the QoS-class resources available on the existing node(s) of the node group. However, to support scaling of empty node groups needs to be worked on, implementing specific mechanisms (e.g. annotations) for each infrastructure provider to inform the -autoscaler about the QoS-class resources of node groups. +autoscaler about the QoS-class resources available on node groups. #### API objects for resources and classes @@ -729,6 +729,10 @@ One possible usage-scenario would be pod-level cgroup controls, e.g. cgroup v2 memory knobs in linux (see [KEP-2570: Support Memory QoS with cgroups v2][kep-2570]. +QoS-class resources could be used to specify OOM kill behavior of individual +containers +(ref [PR for disabling group oom kill](https://github.com/kubernetes/kubernetes/pull/122813)). + Another possible usage could be to allow the user to explicitly specify the desired Pod QoS class of the application (instead of implicitly deriving it from the resource requests/limits). diff --git a/keps/sig-node/3008-qos-class-resources/kep.yaml b/keps/sig-node/3008-qos-class-resources/kep.yaml index 313131ba42d..08b88b7b1e2 100644 --- a/keps/sig-node/3008-qos-class-resources/kep.yaml +++ b/keps/sig-node/3008-qos-class-resources/kep.yaml @@ -17,11 +17,11 @@ stage: alpha # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.30" +latest-milestone: "v1.31" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: - alpha: "v1.30" + alpha: "v1.31" # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled