diff --git a/keps/sig-node/3008-cri-class-based-resources/README.md b/keps/sig-node/3008-cri-class-based-resources/README.md new file mode 100644 index 000000000000..338e383bdcb0 --- /dev/null +++ b/keps/sig-node/3008-cri-class-based-resources/README.md @@ -0,0 +1,997 @@ + +# KEP-3008: Class-based resources in CRI + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +We would like to add support for Linux class-based resources to the CRI +protocol. These are non-accountable resources each presented by a set of +classes which specify the detailed configuration parameters of the resource. +Non-accountable meaning that multiple containers can be assigned to a single +class. + +A prime example of a class-based resource is Intel RDT (Resource Director +Technology). RDT is a technology for controlling the cache lines and memory +bandwidth available to applications. RDT provides a class-based approach for +QoS control of these shared resources: a (fairly limited, by HW) set of classes +can be configured with individual limits for cache allocation and/or memory +bandwidth which processes are then assigned to. + +We also believe that the Linux Block IO controller (cgroup) should be handled +as a class-based resource on the level of container orchestration. This enables +configuring I/O scheduler priority and throttling I/O bandwidth per workload. +Having the support for class-based resources in place, it will provide a +framework for the future, for instance class-based network or memory type +prioritization. + +In addition to the CRI protocol changes, we would like to introduce specific +Pod annotations for controlling the RDT and blockio class of containers. + +## Motivation + + + +RDT implements a class-based means controlling the cache and memory bandwidth +QoS of applications, providing a tool for mitigating noisy neighbors and +fulfilling SLAs. Behind the scenes the control happens via resctrl--a +pseudo-filesystem provided by the Linux kernel which makes it virtually +agnostic of the hardware architecture. The OCI runtime-spec has supported Intel +RDT for a while already. + +The Linux Block IO controller parameters depend very heavily on the underlying +hardware and system configuration (device naming/numbering, IO scheduler +configuration etc) which makes it very impractical to control from the Pod spec +level. In order to hide this complexity the concept of blockio classes is being +added to the container runtimes (CRI-O and containerd). A system administrator +is able to configure blockio controller parameters on per-class basis and the +classes are then made available for CRI clients. + +Adding support for Pod annotations would provide an initial user interface +(behind a feature gate) for the feature and enable easier testing/verification. +These would bridge the gap between enabling class-based resources in the CRI +protocol and making them available in the Pod spec. + +### Goals + + + +- Introduce RDT support comparable to the OCI runtime-spec to CRI protocol. +- Make it possible to specify the RDT class of containers on CRI level. +- Make it possible to specify the blockio class of containers on CRI level. +- Make the extensions flexible, enabling easy addition of other class-based + resource types in the future. + +### Non-Goals + + + +- Interface for configuring the class-based resources. + +## Proposal + + + +We extend the CRI protocol to contain information about the class-based +resource assignment of containers. We define two class-based resources that can +be controlled, i.e. RDT and blockio. + +We introduce a feature gate that enables kubelet to interpret pod annotations +for controlling the RDT and blockio class of containers. + +### User Stories (Optional) + + + +#### Story 1 + +As a user I want to minimize the interference of other applications to my +workload by assigning it to a class with exclusive cache allocation. + +#### Story 2 + +As a cluster administrator I want to control whether RDT support of the +underlying system is made available for users. + +#### Story 3 + +As a user I want to make sure my low-priority, I/O-intensive background task +will not disturb more important workloads running on the same node. + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations + + + +- user assigning container to “unauthorized” class, causing interference and + access to unwanted set/amount of resources -> tackle on pod spec level and + partly in container runtime config +- confusion: user tries to assign container to RDT class but RDT has not been + enabled on system(s) +- keeping client (kubelet) and runtime in sync wrt to available classes + +## Design Details + + + +Configuration and management of the resource classes is fully handled by the +underlying container runtime and is invisible to kubelet. An error to the CRI +client is returned if the specified class is not available. + +### CRI protocol + +The following additions to the CRI protocol are suggested: + +```diff + // LinuxContainerConfig contains platform-specific configuration for + // Linux-based containers. + message LinuxContainerConfig { + // Resources specification for the container. + LinuxContainerResources resources = 1; + // LinuxContainerSecurityContext configuration for the container. + LinuxContainerSecurityContext security_context = 2; ++ // LinuxContainerClassResources configuration of the container ++ LinuxContainerClassResources class_resources = 3; + } + ++// ++enum ClassResource { ++ rdt = 1; ++ blockio = 2; ++} + ++// LinuxContainerClassResources specifies Linux specific configuration of class ++// based resources. ++message LinuxContainerClassResources { ++ // Resource classes of the container will be assigned to ++ map class = 1; ++} +``` + +### Pod annotations + +A feature gate ResourceClassPodAnnotations enables kubelet to look for pod +annotations and set the RDT and blockio class of containers via CRI protocol +accordingly: + +- `rdt.resources.beta.kubernetes.io/pod` for setting a Pod-level default RDT + class for all containers +- `rdt.resources.beta.kubernetes.io/container.` for + container-specific RDT class settings +- `blockio.resources.beta.kubernetes.io/pod` for setting a Pod-level default + blockio class for all containers +- `blockio.resources.beta.kubernetes.io/container.` for + container-specific blockio class settings + +### Container runtimes + +We have open PRs to implement class-based RDT and blockio support in CRI-O and +containerd: + +- cri-o: + - [~~Add support for Intel RDT~~](https://github.com/cri-o/cri-o/pull/4830) + - [~~Support for cgroups blockio~~](https://github.com/cri-o/cri-o/pull/4873) +- containerd: + - [Support Intel RDT](https://github.com/containerd/containerd/pull/5439) + - [Support for cgroups blockio](https://github.com/containerd/containerd/pull/5490) + +The design paradigm here is that the container runtime configures the resource +classes according to a given configuration file. Enforcement on containers is +done via OCI. + +### Open Questions + +#### LinuxContainerClassResources or ContainerConfig? + +Should the `LinuxContainerClassResources` be moved "up" into `ContainerConfig` +(and renamed `ContainerClassResources`)? This would make it generic, not just +Linux. + +```diff + message ContainerConfig { + + ... + // Configuration specific to Linux containers. + LinuxContainerConfig linux = 15; + // Configuration specific to Windows containers. + WindowsContainerConfig windows = 16; ++ ++ // Configuration of class resources. ++ ContainerClassResources class_resources = 17; + } + ++// ContainerClassResources specifies the configuration of class based ++// resources of a container. ++message ContainerClassResources { ++ // Resource classes of the container will be assigned to ++ map class = 1; ++} +``` + +#### Pod QoS class + +Maybe we should communicate the Pod QoS class to the container runtime via +class resources, too. Container runtimes (CRI-O, at least) are already +depending on this information and currently determining it indirectly by +evaluating other CRI parameters. It would be better to explicitly state the Pod +QoS class and class resources would look like a logical place for that. This +also makes it techically possible to have container-specific QoS classes (as a +possible future enhancement of K8s). + +Communicating Pod QoS class via class resources would advocate moving class +resources up to `ContainerConfig`. + +It would also be possible to separate `oom_score_adj` from the pod qos class. +The runtime could provice a set of OOM classes, making it possible for the user +to specify a burstable pod with low oom priority (low chance of being killed). + +#### Class discovery and syncing + +Resource/class discovery and syncing: should the runtime be able to tell the +client what resource types are available (and list available classes in each)? +As a reference, the API currently allows listing of some objects/resources +(Pods, Containers, Images etc) but not some others. + +If listing is supported then the `ClassResource` enum would be dropped + + +#### Resource information on pod sandbox level + +Pod sandbox level configuration would make it possible for runtimes to do +educated decisions on resource allocation before any containers are created. +Could be a separate KEP, including other resources, too. + +```diff + // LinuxPodSandboxConfig holds platform-specific configurations for Linux + // host platforms and Linux-based containers. + message LinuxPodSandboxConfig { + // Parent cgroup of the PodSandbox. + // The cgroupfs style syntax will be used, but the container runtime can + // convert it to systemd semantics if needed. + string cgroup_parent = 1; + // LinuxSandboxSecurityContext holds sandbox security attributes. + LinuxSandboxSecurityContext security_context = 2; + // Sysctls holds linux sysctls config for the sandbox. + map sysctls = 3; ++ // Kubernetes resource spec of the containers of the Pod. ++ repeated LinuxContainerResourceConfig container_resources = 4; + } + ++message ContainerResourceConfig { ++ // Name of the container. Same as the container name in the PodSpec. ++ string name = 1; ++ // LinuxContainerClassResources configuration of the container. ++ LinuxContainerClassResources class_resources = 2; ++} +``` + +### Test Plan + + + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: + - Components depending on the feature gate: +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). + +###### Does enabling the feature change any default behavior? + + + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +### Pod Spec + +Instead of introducing Pod annotations as an intermediate solution for +controlling the resource classes (RDT, blockio), the Pod spec could be updated +in lock-step with the CRI. + +One straightforward way to do this could be to add a new field (e.g. class) +into ResourceRequirements of Container. + +```diff +// ResourceRequirements describes the compute resource requirements. +type ResourceRequirements struct { + // Limits describes the maximum amount of compute resources allowed. + Limits ResourceList `json:"limits,omitempty" + // Requests describes the minimum amount of compute resources required. + Requests ResourceList `json:"requests,omitempty" ++ // Classes specifies the resource classes that the container should be assigned ++ Classes map[ClassResourceName]string +} + ++// ClassResourceName is the name of a class-based resource. ++type ClassResourceName string + ++const ( ++ ClassResourceRdt ClassResourceName = “rdt” ++ ClassResourceBlockio ClassResourceName = “blockio” ++) +``` + +Access control to the classes could be achieved by extending ResourceQuotaSpec. + +```diff +// ResourceQuotaSpec defines the desired hard limits to enforce for Quota. +type ResourceQuotaSpec struct { + // hard is the set of desired hard limits for each named resource. + Hard ResourceList + // A collection of filters that must match each object tracked by a quota. + // If not specified, the quota matches all objects. + Scopes []ResourceQuotaScope + // scopeSelector is also a collection of filters like scopes that must match each + // object tracked by a quota but expressed using ScopeSelectorOperator in combination + // with possible values. + ScopeSelector *ScopeSelector ++ // AllowedClasses specifies the list of allowed classes for each class-based resource ++ AllowedClasses map[ClassResourceName]ResourceClassList +} + ++// ResourceClassList is a list of classes of a specific type of class-based resource. ++type ResourceClassList []string +``` + +### RDT-only + +The scope of the KEP could be narrowed down by concentrating on RDT only, +dropping support for blockio. This would center the focus on RDT only which is +well understood and specified in the OCI runtime specification. + + +## Infrastructure Needed (Optional) + + + +For proper end-to-end testing of RDT, a cluster with nodes that have RDT +enabled would be required. Similarly, for end-to-end testing of blockio, nodes +with blockio cgroup controller and suitable i/o scheduler enabled would be +required.