Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove dependency on downwardAPI volumes #4937

Closed
hWorblehat opened this issue Jun 3, 2022 · 10 comments
Closed

Remove dependency on downwardAPI volumes #4937

hWorblehat opened this issue Jun 3, 2022 · 10 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@hWorblehat
Copy link
Contributor

I'm trying to use tekton on my AKS cluster and would like to run task pods on ACI virtual kubelets. Unfortunately, the ACI virtual kubelet provider doesn't currently support downwardAPI volumes which are used by tekton task pods. There is an issue in the ACI provider project (virtual-kubelet/azure-aci#91) requesting support, but I doubt there will be any progress on it any time soon, given that there's no support for a volume type in the underlying Azure service that might be adapted for this purpose (see the left-hand menu options here for supported volume types).

Removing (or providing an optional alternative to) the single use of a downwardAPI volume in tekton would allow it to be used with ACI virtual kubelets.

The following comment gives me hope that this is something that's already being planned:

// TODO(#1605): Signal sidecar readiness by injecting entrypoint,
// remove dependency on Downward API.
downwardVolume = corev1.Volume{

However, the referenced issue #1605 appears to be much broader, doesn't specifically talk about the removal of downwardAPI and looks on the verge of being declared 'done'.

As I understand it, this volume is used solely for signalling that all sidecars are running before allowing the first step to proceed. I might be happy to offer a contribution here, but I don't currently understand what exactly is meant by the comment:

Signal sidecar readiness by injecting entrypoint

Perhaps @imjasonh or someone else could provide some illumination?

An alternative quick and dirty fix for the cases where sidecars aren't used would be to simply remove the wait arguments to the first entrypoint (and the volume itself) based on similar logic to the shouldAddReadyAnnotationOnPodCreate function:

argsForEntrypoint = []string{
// First step waits for the Downward volume file.
"-wait_file", filepath.Join(downwardMountPoint, downwardMountReadyFile),
"-wait_file_content", // Wait for file contents, not just an empty file.

pipeline/pkg/pod/pod.go

Lines 374 to 378 in 3a305a0

// shouldAddReadyAnnotationOnPodCreate returns a bool indicating whether the
// controller should add the `Ready` annotation when creating the Pod. We cannot
// add the annotation if Tekton is running in a cluster with injected sidecars
// or if the Task specifies any sidecars.
func shouldAddReadyAnnotationOnPodCreate(ctx context.Context, sidecars []v1beta1.Sidecar) bool {

@imjasonh
Copy link
Member

imjasonh commented Jun 3, 2022

Some background on why we use the Downward API today:

When there are sidecars, we need some way for the controller to send a signal to the first step's injected entrypoint binary that it's okay to start. Today, the controller does this by observing the Ready state of any sidecars, and updating an annotation which is projected via downward API where the first step can observe the file contents change and know to start.

Today, sidecar containers don't have the entrypoint binary injected, and just start, become ready, run, and are stopped, just like regular containers in a Pod. Because they're just regular containers in a Pod.

It works today, mostly, but it's a bit brittle, and it doesn't work in scenarios like yours where the downward API is not available. 😕

The general idea behind the "Signal sidecar readiness by injecting entrypoint" comment was that, instead of having sidecars be "normal" containers, to inject the entrypoint binary into them too, like we do to order steps. This gives us more control over their lifecycle, and more visibility. We've talked about using this to more gracefully stop the sidecars, but it could also give us a way to detect if the sidecar container is ready, and inject logic to write a file in a shared volume to start the first step. If there are multiple sidecars, each would write an "I'm ready" file, and the first step would be configured to wait for all of them.

This would remove the need for the downward API, with the small added complexity of injecting the entrypoint binary on sidecars (we do it for all steps today, and it mostly works, so that's not that bad), and having the first step wait for multiple signals (I believe the code supports it today, but it isn't used in practice today AFAIK), and of needing to have some way for the injected entrypoint binary to detect the sidecar container's readiness (this is the big one IMO). The entrypoint can tell if it's started, but that's not the same as being Ready, and we'd have to think a bit about how to detect that reliably, without adding any more dependencies on brittle things like downward API.

If anybody has any ideas I'm all for them, the downward API has always been kind of a hack, and our usage of it especially. I'd be happy to meet and walk through code or brainstorm ideas if you ping me on Slack.

@hWorblehat
Copy link
Contributor Author

Thanks for the very quick response!

instead of having sidecars be "normal" containers, to inject the entrypoint binary into them too, like we do to order steps

Won't this present a problem for injected sidecars (which you currently also check the 'readiness' of)?

I'd be happy to meet and walk through code or brainstorm ideas if you ping me on Slack.

Thank you for the kind offer! I may take you up on it, but it won't be for a while (I'm afraid I'm a bit busy right now). Now that you've explained it, I can see that this is a rather more involved problem than I'd hoped. 😞

Would a temporary solution of having a feature flag disable-await-sidecar-readiness be palatable? After all, the normal pod contract is that containers must do their own checking if they need to wait for their sidecars to be in a particular state.

@imjasonh
Copy link
Member

imjasonh commented Jun 3, 2022

Won't this present a problem for injected sidecars (which you currently also check the 'readiness' of)?

Yes! Excellent point. Let's not forget injected sidecars (the bane of my existence).

This reminds me there's a feature flag to disable behavior to account for injected sidecars, running-in-environment-with-injected-sidecars: false, in case that's useful at all.

Would a temporary solution of having a feature flag disable-await-sidecar-readiness be palatable? After all, the normal pod contract is that containers must do their own checking if they need to wait for their sidecars to be in a particular state.

This sounds fine to me personally, you'd probably want to get feedback from others (@vdemeester @jerop @afrittoli @pritidesai) in case they can think of some reason not to do it.

@vdemeester
Copy link
Member

This sounds fine to me personally, you'd probably want to get feedback from others (@vdemeester @jerop @afrittoli @pritidesai) in case they can think of some reason not to do it.

That does sounds fine to me as well.

@hWorblehat
Copy link
Contributor Author

I've raised #4953 to hopefully provide a temporary solution for this. Let me know what you think.

@hWorblehat
Copy link
Contributor Author

#4953 is now merged, but it's only a partial solution. It's good enough for me, for the moment, but the Downward API volume is still needed if one requires waiting on sidecars.

I'll leave it up to the Tekton team to decide whether to close this issue or leave it open.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 10, 2022
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 9, 2022
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Repository owner moved this from Todo to Done in Tekton Community Roadmap Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
Status: Done
Development

No branches or pull requests

4 participants