-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3008: QoS-class resources #3004
base: master
Are you sure you want to change the base?
KEP-3008: QoS-class resources #3004
Conversation
Skipping CI for Draft Pull Request. |
/cc @kad @haircommander |
Hi @marquiz All KEP PRs must have an open issue in k/enhancements (this repo). Please open an issue and fill it out completely and rename this PR to KEP-issue number but in the title of this PR and in your README.md Thanks! |
3c8bfb5
to
ead1a96
Compare
Thanks for the guidance @kikisdeliveryservice. Done |
ff20ea2
to
5f2e9fe
Compare
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/retitle KEP-3008: Class-based resources |
eefa0e3
to
8786f80
Compare
c43c5d7
to
f2001a4
Compare
I'm trying to here capture here a summary of the PR evolution and status. The PR has been reviewed by numerous people and it has so many comments/conversations that it becomes impossible for me to summarize everything in a nice way. Bits that are marked as unresolved as of today:
Some previous concerns/shortcomings that have later been addressed:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran out of time for this session, more later.
to other Kubernetes resource types (i.e. native resources such as `cpu` and | ||
`memory` or extended resources) because you can assign that resource to a | ||
particular container. However, QoS-class resources are also different from | ||
those other resources because they are used to assign a _class identifier_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have concrete examples of things where we want to apply QoS, but which REALLY don't feel like "resources"?
While "QoS" is somewhat jargon-y, it's pretty well understood in my experience. Alternately, if this is just about names, we could look for similar words?
"banded resources" ? It sort of implies linearness of bands
"tiered resources" ? Also implies linearity, which I don't think is necessarily true
"categorized" ?
"qualitative" ?
+ Name QoSResourceName | ||
+ // Allowed classes. | ||
+ Classes []string | ||
+ // Capacity is the hard limit for usage of the class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love that this is punted with no real idea how we might solve it
|
||
#### Cluster autoscaler | ||
|
||
The cluster autoscaler support will be extended to support QoS-class resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this paragraph vague and unconvincing. :)
Autoscaling has really become a requirement for many users, and this text doesn't explain to me how the autoscalers (we now have more than one!) are able to reason about "if I make a new node of shape X, it will satisfy this pod". We left this for "later" in other efforts and it really bit us in the butt.
If you think you have an answer, can you please flesh this out? If you don't think you have an answer, we need to think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point @thockin (especially in the light of the recent DRA activities). I admit that this is vague and lacking details.
I will get back to this. My plan is now to have a PoC of a cluster autoscaler to have confidence that it can be done and write out the details here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin no offence taken 😅
I took me a while to get back to this as I was not really familiar with the cluster autoscaler details. I wanted to try it out in practice and I did do a PoC implementation based on a kind cluster with CAPD deployed and running cluster-autoscaler against that. Scaling up node groups with one (or more) existing nodes worked pretty much out-of-the-box (after building cluster-autoscaler against my PoC kubernetes scheduler implementation). Because all of the scheduling logic (of QoS resources) is in k/k scheduler itself, the autoscaler correctly runs simulations and can determine how many nodes in what nodegroups need to be scaled up (based on the QoS resources available in existing nodes), also providing log messages why certain node group isn't suitable. What is missing is the provider-specific mechanism to inform the cluster autoscaler about QoS-resources of empty node groups but I wouldn't expect that to be an insurmountable problem (follow along the lines what is done for extended resources). I updated the KEP accordingly. WDYT?
Why should this KEP _not_ be implemented? | ||
--> | ||
|
||
## Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this not be done with existing device plugins framework and CDI https://github.com/cncf-tags/container-device-interface/blob/main/SPEC.md#cdi-json-specification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea I cannot think how CDI could satisfy all the requirements here 🤔 First there's the Kubernetes API (for which CDI brings no avail). Then, complicating the device plugin API for the QoS-resources hasn't even been in considered tbh (instead, we try to make kubelet's life easier)
Addressing review feedback from thocking.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: marquiz The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Updated details on cluster autoscaler |
- bump versions in kep.yaml - fix typos - added a use case for per-container OOM kill behavior
Small update, added per-container OOM kill behavior as a use case. There is a growing list of use cases for Kubernetes-managed QoS resources with links to ongoing efforts (e.g. swap and OOM kill) |
@marquiz As the person who's been most actively pushing for the OOM config option and just found about this KEP, I strongly agree that this is a great fit! This design is seems awesome and is a really nice generalization of the specific problem I was hoping to have solved! |
container whose memory limit is higher than the requests, but that is treated | ||
by the node as `Guaranteed`. | ||
|
||
Taking this idea further, QoS-class resources could also make it possible to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe kubernetes/kubernetes#78848 is a good tangigle use case to mention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kannon92 for the reference, I'll add it to the proposal. This is already speculated in Splitting Pod QoS Class section but I wasn't aware of this open issue.
QoS-class resource. Likely benefits of using the QoS-class resources mechanism | ||
would be to be able to set per-namespace defaults with LimitRanges and allow | ||
permission-control to high-priority classes with ResourceQuotas. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea is smarter oom setups (systemd-oomd for example).
systemd-oomd allows one to set settings for oom (psi based pressure and swap usage) on a cgroup slice. I could envision a future where one can create a QoS and that gives knobs to control systemd-oomd for more aggressive oom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://fedoraproject.org/wiki/Changes/EnableSystemdOomd
I was playing around with this and I explored the ideas of setting different knobs based on the existing kubepods slices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. So you think a about pre-defined oom kill classes (in contrast to e.g. exposing every possible systemd-oomd knob to the user). At first thought, this sounds like a good fit. I'll add this to the proposal too, as a possible future use case.
that kubelet could evict a running pod that request QoS-class resources that | ||
are no more available on the node. This should be relatively straightforward to | ||
implement as kubelet knows what QoS-class resources are available on the node | ||
and also monitors all running pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some areas that eviction manager is starting to show its age.
- If one wants to add more disks to kubelet, we lose the ability for eviction manager to monitor it.
- PSI based eviction
- Swap based eviction
- Moving things out of root partition (logs come in mind)
I've been thinking that eventually we are going to need some kind of pluggable eviction manager based on certain resources. It should be general but I haven't really made much progress on it. wanted to throw this out there as it may be worth considering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kannon92 thanks for the idea. Do you have any PoC implementation on this? If you have, it would be nice to wire it up to my code and see how it works.
|
||
<!-- References --> | ||
[intel-rdt]: https://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html | ||
[linux-resctrl]: https://www.kernel.org/doc/html/latest/x86/resctrl.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This URL now 404s
Hey @marquiz, it's great to see that this is making progress! Do you have any updates on the timeline for when this version is planned to be released? The KEP indicates it's planned for 1.31, but I was wondering if it might end up being pushed to 1.32 or later instead. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
New KEP for adding class-based resources to CRI protocol