Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore!: remove legacy patch pods fallback #13100

Merged

Conversation

Garett-MacGowan
Copy link
Contributor

@Garett-MacGowan Garett-MacGowan commented May 27, 2024

PR Takeover #12976. Description is mostly verbatim

Motivation

Modifications

  • remove the Executor code that patches pods
  • remove the operator code that reads the patched annotations

Verification

Tests pass

@Garett-MacGowan Garett-MacGowan marked this pull request as draft May 27, 2024 13:43
@Garett-MacGowan Garett-MacGowan changed the title Chore exec remove patch pods fallback chore!: remove legacy patch pods fallback May 27, 2024
@Garett-MacGowan
Copy link
Contributor Author

@agilgur5 my WIP takeover. I'll finish this at some point this week.

@Garett-MacGowan
Copy link
Contributor Author

@agilgur5 Some of these tests are failing because some of the test workflows use the argo service account and the argo service account doesn't have create and patch permissions for workflowtaskresults. Would there be an issue with adding those permissions to the argo-role?

@agilgur5
Copy link
Member

agilgur5 commented Jun 8, 2024

some of the test workflows use the argo service account and the argo service account doesn't have create and patch permissions for workflowtaskresults.

Oh. That's surprising, that means these tests have been using the wrong RBAC for some time and relying on the fallback 😬
That could potentially explain some uncaught bugs too

Would there be an issue with adding those permissions to the argo-role?

Yes, especially in the official install manifests as you have them now. The Controller does not need those privileges and so shouldn't have them.
The Executor should really have its own SA & Role. We can add those to the test overlays. I don't remember which install the tests use, but if they already use the quick start, I believe there is an Executor Role in those manifests already (although I might be mistaken, it may just have excess privileges in general since it's a non-prod quick start).

@Garett-MacGowan
Copy link
Contributor Author

Garett-MacGowan commented Jun 8, 2024

@agilgur5 yes, the tests use the various quick start manifests depending on the profile you choose. There is an executor role (same as in the docs) that is bound to the default service account. Are you suggesting that I bind the executor role to the argo service account in the e2e manifest mixins?

Below is one of the test cases where the argo service account needs the executor permissions. If I'm not mistaken, the template service account (not just the pod service account) needs the executor permissions. Since it is creating a pod resource, it also needs the create pod permissions, which are bound to the argo service account.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: k8s-resource-tmpl-with-pod-
spec:
  serviceAccount: argo
  entrypoint: main
  templates:
    - name: main
      serviceAccountName: argo
      resource:
        action: create
        setOwnerReference: true
        successCondition: status.phase == Succeeded
        failureCondition: status.phase == Failed
        manifest: |
          apiVersion: v1
          kind: Pod
          metadata:
            generateName: k8s-pod-resource-
          spec:
            serviceAccountName: argo
            containers:
            - name: argosay-container
              image: argoproj/argosay:v2
              command: ["/argosay"]
            restartPolicy: Never

See https://github.com/argoproj/argo-workflows/actions/runs/9424092511/job/25963802079

I could also create a dedicated service account for these tests with the required permissions, I suppose.

@agilgur5
Copy link
Member

agilgur5 commented Jun 8, 2024

that is bound to the default service account.

Ah right of course since quick start.

Are you suggesting that I bind the executor role to the argo service account in the e2e manifest mixins?

I could also create a dedicated service account for these tests with the required permissions, I suppose.

The latter would be more correct and least privilege (i.e. less possible future bugs with regard to the tests)

If I'm not mistaken, the template service account (not just the pod service account) needs the executor permissions

The Pod within the resource doesn't need an SA at all since it just runs argosay. Only the template needs the SA, which runs the create action and ofc needs Executor permissions.

@Garett-MacGowan
Copy link
Contributor Author

Garett-MacGowan commented Jun 9, 2024

I just realized the pod create permissions are bound to the default service account with the pod-manager role. Switching the tests that used the argo service account to use the default service account instead.

@agilgur5 agilgur5 self-assigned this Jun 9, 2024
@Garett-MacGowan
Copy link
Contributor Author

Garett-MacGowan commented Jun 12, 2024

@agilgur5 I think this test is/was wrong? The container doesn't inherit permissions from the executor, right?

I changed

  executor:
    serviceAccountName: argo

to

  executor:
    serviceAccountName: default

The full test

func (s *WorkflowSuite) TestContainerTemplateAutomountServiceAccountTokenDisabled() {
	s.Given().Workflow(`
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: get-resources-via-container-template-
  namespace: argo
spec:
  serviceAccountName: argo
  automountServiceAccountToken: false
  executor:
    serviceAccountName: default
  entrypoint: main
  templates:
    - name: main
      container:
        name: main
        image: bitnami/kubectl
        command:
          - sh
        args:
          - -c
          - |
           kubectl get cm 
`).
		When().
		SubmitWorkflow().
		WaitForWorkflow(fixtures.ToBeSucceeded, time.Minute*11).
		Then().
		ExpectWorkflow(func(t *testing.T, metadata *metav1.ObjectMeta, status *wfv1.WorkflowStatus) {
			assert.Equal(t, wfv1.WorkflowSucceeded, status.Phase)
		})
}

The failure

    workflow_test.go:53: condition never and cannot be met because the workflow is done
Checking expectation get-resources-via-container-template-slp6q
get-resources-via-container-template-slp6q : Failed Error (exit code 1)
    workflow_test.go:56: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/workflow_test.go:56
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:69
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:44
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/workflow_test.go:55
        	Error:      	Not equal: 
        	            	expected: "Succeeded"
        	            	actual  : "Failed"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(v1alpha1.WorkflowPhase) (len=9) "Succeeded"
        	            	+(v1alpha1.WorkflowPhase) (len=6) "Failed"
        	            	 
        	Test:       	TestWorkflowSuite/TestContainerTemplateAutomountServiceAccountTokenDisabled

Shouldn't we be expecting the failure?

Edit:
Nevermind, I see what is going on. The executor service account is used for all the containers in a pod if AutomountServiceAccountToken is false.

@Garett-MacGowan Garett-MacGowan marked this pull request as ready for review June 14, 2024 06:12
@Garett-MacGowan
Copy link
Contributor Author

@agilgur5 Ready for review

Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I totally forgot to add a release note in my initial variant of the PR -- we should add a small one to the 3.6 upgrading.md and just link to the Workflow RBAC page (where I removed the old RBAC in #12975)

@Garett-MacGowan Garett-MacGowan force-pushed the chore-exec-remove-patch-pods-fallback branch from a7e2653 to e46a675 Compare June 27, 2024 21:03
@Garett-MacGowan
Copy link
Contributor Author

Garett-MacGowan commented Jun 27, 2024

Oh I totally forgot to add a release note in my initial variant of the PR -- we should add a small one to the 3.6 upgrading.md and just link to the Workflow RBAC page (where I removed the old RBAC in #12975)

@agilgur5 added.

@Joibel
Copy link
Member

Joibel commented Jul 4, 2024

This looks good to me, thanks for the work on it @Garett-MacGowan.

I was wondering if we could modify the error message you see in the GUI to include a link to the documentation for fixing it as it's really common to see people with incorrect RBAC still. Currently you see

Error (exit code 1): workflowtaskresults.argoproj.io is forbidden: User "system:serviceaccount:argo:default" cannot create resource "workflowtaskresults" in API group "argoproj.io" in the namespace "argo"

which is probably obvious to you and I, but for most people who hope it'd just work, they'll be confused.

It doesn't have to be a part of this PR though.

I'll leave it to @agilgur5 to have a look over it.

@agilgur5 agilgur5 added this to the v3.6.0 milestone Jul 12, 2024
@Garett-MacGowan
Copy link
Contributor Author

Garett-MacGowan commented Jul 12, 2024

@Joibel Thanks. I could add a link to the GUI. Will put it in another PR.

@Joibel
Copy link
Member

Joibel commented Aug 29, 2024

@Garett-MacGowan, any chance you could bring this up to date so it can be merged?

@Garett-MacGowan
Copy link
Contributor Author

Garett-MacGowan commented Aug 29, 2024

Sure, I'll work on it today

@jswxstw
Copy link
Member

jswxstw commented Sep 5, 2024

https://argo-workflows.readthedocs.io/en/stable/progress/#self-reporting-progress

Initially the progress of a workflows' pod is always 0/1. If you want to influence this, make sure to set an initial progress annotation on the pod:

This functionality will be affected, we will never be able to set the initial progress of the workflows' pod.

So, these lines should be kept, common.AnnotationKeyProgress can't be deleted directly. Related PR: #13260

// Set initial progress from pod metadata if exists.
if x, ok := pod.ObjectMeta.Annotations[common.AnnotationKeyProgress]; ok {
if p, ok := wfv1.ParseProgress(x); ok {
node, err := woc.wf.Status.Nodes.Get(nodeID)
if err != nil {
woc.log.Panicf("was unable to obtain node for %s", nodeID)
}
node.Progress = p
woc.wf.Status.Nodes.Set(nodeID, *node)
}
}

@Garett-MacGowan
Copy link
Contributor Author

Garett-MacGowan commented Sep 6, 2024

Thanks @jswxstw.

@Joibel sorry, life stuff happened. I'll get to this today, hopefully.

- the fallback is old and insecure, and the error confuses users as it's not mentioned in the docs (as it's legacy and a fallback)
- it's also tech debt that we have to write code around specifically right now
- it's no longer needed and hasn't been the main RBAC in a few versions, so remove it in the next minor

- remove the Executor code that patches pods
- remove the operator code that reads the patched annotations

Signed-off-by: Anton Gilgur <[email protected]>
Signed-off-by: Garett MacGowan <[email protected]>
Signed-off-by: Garett MacGowan <[email protected]>
Signed-off-by: Garett MacGowan <[email protected]>
Signed-off-by: Garett MacGowan <[email protected]>
Signed-off-by: Garett MacGowan <[email protected]>
Signed-off-by: Garett MacGowan <[email protected]>
Signed-off-by: Garett MacGowan <[email protected]>
Signed-off-by: Garett MacGowan <[email protected]>
Signed-off-by: Garett MacGowan <[email protected]>
@Garett-MacGowan
Copy link
Contributor Author

Garett-MacGowan commented Sep 9, 2024

I was wondering if we could modify the error message you see in the GUI to include a link to the documentation for fixing it as it's really common to see people with incorrect RBAC still. Currently you see

Error (exit code 1): workflowtaskresults.argoproj.io is forbidden: User "system:serviceaccount:argo:default" cannot create resource "workflowtaskresults" in API group "argoproj.io" in the namespace "argo"

Could you point me to where this is displayed in the UI? It has been a while. Thanks.

@Garett-MacGowan Garett-MacGowan force-pushed the chore-exec-remove-patch-pods-fallback branch from e46a675 to ac3d68f Compare September 9, 2024 04:52
@Garett-MacGowan
Copy link
Contributor Author

@Joibel @agilgur5 Ready for review.

@jswxstw Do you mind taking a look to verify that what I've done is what you had in mind?

Thanks.

@jswxstw
Copy link
Member

jswxstw commented Sep 10, 2024

@jswxstw Do you mind taking a look to verify that what I've done is what you had in mind?

It looks good to me now.

By the way, does this document need to be updated since its content is outdated?

The executor will read this file every 3s and if there was an update, patch the pod annotations with workflows.argoproj.io/progress: N/M.

@Garett-MacGowan
Copy link
Contributor Author

Garett-MacGowan commented Sep 11, 2024

What in particular is out of date? The pod patches still happen, they just need the RBAC now. See monitorProgress of executor.go

@jswxstw
Copy link
Member

jswxstw commented Sep 12, 2024

Aren’t pod patches removed by the current PR and progress can only be reported to WorkflowTaskResult now?

@Garett-MacGowan
Copy link
Contributor Author

Garett-MacGowan commented Sep 13, 2024

Aren’t pod patches removed by the current PR and progress can only be reported to WorkflowTaskResult now?

Right, sorry. The updates to progress are done via WorkflowTaskResult and are reflected by

newNode.Progress = result.Progress
...
woc.wf.Status.Nodes.Set(nodeID, *newNode)

in taskResultReconciliation. The pod annotations are not updated anymore.

How about the following change?

Instead of:

The executor will read this file every 3s and if there was an update, patch the pod annotations with workflows.argoproj.io/progress: N/M. The controller picks this up and writes the progress to the appropriate Status properties every 1m.

We can say:

The executor will read this file every 3s, and if there was an update, create a WorkflowTaskResult containing the progress update. The controller picks this up and writes the progress to the appropriate Status properties.

@Garett-MacGowan
Copy link
Contributor Author

Garett-MacGowan commented Sep 13, 2024

Wait... I just checked the file on my branch. It has already been updated...

The executor will read this file every 3s and if there was an update, patch the matching WorkflowTaskResults resource with the value N/M. The controller picks this up and writes the progress to the appropriate Status properties
every 1m.

@Garett-MacGowan
Copy link
Contributor Author

@agilgur5 I think this is ready for review.

Copy link
Member

@Joibel Joibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion, but LGTM

workflow/controller/controller_test.go Outdated Show resolved Hide resolved
Co-authored-by: Alan Clucas <[email protected]>
Signed-off-by: Garett MacGowan <[email protected]>
@Garett-MacGowan
Copy link
Contributor Author

One suggestion, but LGTM

Thanks @Joibel

@Joibel Joibel merged commit 31493d3 into argoproj:main Sep 14, 2024
27 checks passed
@agilgur5
Copy link
Member

agilgur5 commented Sep 20, 2024

I was wondering if we could modify the error message you see in the GUI to include a link to the documentation for fixing it as it's really common to see people with incorrect RBAC still. Currently you see

Error (exit code 1): workflowtaskresults.argoproj.io is forbidden: User "system:serviceaccount:argo:default" cannot create resource "workflowtaskresults" in API group "argoproj.io" in the namespace "argo"

Could you point me to where this is displayed in the UI? It has been a while. Thanks.

Sorry I forgot to respond to this at the time Alan wrote it and then had not been feeling well for much of the past week.

This does not appear in the UI per se, it appears in logs, which can appear in the UI. I did specifically address this with an FAQ entry in #13041 (as listed in the PR description).
That should be changed for 3.6 then. EDIT: ah I did mention that in "Future Work" there

Alan specifically meant to catch that error and instead return a link to the docs. This line logs a link to the docs, but does not change the end error

roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: workflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did this role not exist in the quick start?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it was supposed to be removed / renamed in #7999 (see also this removed file) but got replaced by other files instead if I'm reading correctly

Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait... I just checked the file on my branch. It has already been updated...

Re: progress, #12089 modified this and was recently merged (the PR is a year old otherwise). I actually mentioned @jswxstw's #13260 there too; I'm honestly not entirely sure how self-reported progress is supposed to work now and which versions it changed in -- I'm guessing in 3.4, same as WorkflowTaskResults in general. We should probably backport some of the documentation changes based on that, for instance.

I'm also not sure if the test helper withProgress and its usage should have been removed here then; it sounds like it perhaps should be modified similarly to use WorkflowTaskResults and that tested?

}
}
func withProgress(v string) with { return withAnnotation(common.AnnotationKeyProgress, v) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as in this line

assert.Equal(t, wfv1.Progress("50/100"), pod.Progress)

// mock workflow uses legacy/insecure pod patch
makePodsPhase(ctx, woc, apiv1.PodSucceeded, withAnnotation(common.AnnotationKeyReportOutputsCompleted, "true"), withProgress("100/100"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and these assertions above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants