Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate SCC-related resources when on OCP #179

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

LCaparelli
Copy link
Member

@LCaparelli LCaparelli commented Oct 18, 2020

Fix #51

TODO:

  • Perform initial implementation
  • Write tests
  • Write documentation
  • Update release notes

Signed-off-by: Lucas Caparelli [email protected]

@LCaparelli LCaparelli added enhancement 👑 New feature or request WIP 👷‍♀️ Work In Progress PR Hacktoberfest labels Oct 18, 2020
@LCaparelli LCaparelli added this to the v0.4.0 milestone Oct 18, 2020
@LCaparelli
Copy link
Member Author

LCaparelli commented Oct 18, 2020

@ricardozanini wrote a "blind" initial implementation. Can you test it out? Just spinning an instance with the community image should do the trick.

If I got this right, the Security Manager will realize it's on OCP and create:

  • the SCC
  • a Cluster Role which allows using the SCC
  • a (namespaced) Role Binding binding the Service Account from the Nexus CR to that Cluster Role

Some considerations:

  • design based on this doc
  • it would be awesome if you could also test this with a custom Service Account, we need to make sure this works with both the default one we create and with a user-supplied one
  • it would be awesome if you could check if the Role Binding does not get created when using an image which is not the community image and that the SCC and the Cluster Role do get created regardless of that
  • we don't use a Cluster Role Binding because then the Service Account would be able to use the SCC in any namespace, which is not intended usage. Keeping it as a Role Binding allows us to make sure the SA will only use that SCC in that same namespace. See this doc, after the default roles table there is an explanation of this behavior

@ricardozanini
Copy link
Member

Can you please generate a nexus-operator.yaml file and a image with this PR?

@LCaparelli

This comment has been minimized.

@ricardozanini
Copy link
Member

Thanks!

@ricardozanini
Copy link
Member

ricardozanini commented Oct 23, 2020

@LCaparelli next time please use pastebin or gist to share the YAML 😛

I got:

Status:
  Deployment Status:
  Nexus Status:  Failure
  Reason:        Failed to deploy Nexus: could not fetch Security Context Constraint (nexus-issue-179/nexus3): no kind is registered for the type v1.SecurityContextConstraints in scheme "pkg/runtime/scheme.go:101"

You might need to register this API Schema with our client :)

Logs:

tch Security Context Constraint (nexus-issue-179/nexus3): no kind is registered for the type v1.SecurityContextConstraints in scheme \"pkg/runtime/scheme.go:101\""}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:246
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:197
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
2020-10-23T21:48:53.390Z	INFO	controllers.Nexus	Reconciling Nexus
2020-10-23T21:48:53.419Z	DEBUG	Fetching the latest micro from minor	{"MinorVersion": 28}
2020-10-23T21:48:53.420Z	DEBUG	Replacing 'spec.image'	{"OldImage": "docker.io/sonatype/nexus3:3.28.1", "NewImage": "docker.io/sonatype/nexus3:3.28.1"}
2020-10-23T21:48:53.504Z	INFO	Generating required resources
2020-10-23T21:48:53.504Z	DEBUG	Generating required resource	{"kind": "Deployment"}
2020-10-23T21:48:53.504Z	DEBUG	Generating required resource	{"kind": "Service"}
2020-10-23T21:48:53.504Z	DEBUG	Generating required resource	{"kind": "Service Account"}
2020-10-23T21:48:53.504Z	DEBUG	Generating required resource	{"kind": "Secret"}
2020-10-23T21:48:53.504Z	DEBUG	Generating required resource	{"kind": "Security Context Constraint"}
2020-10-23T21:48:53.504Z	DEBUG	Generating required resource	{"kind": "Cluster Role"}
2020-10-23T21:48:53.504Z	INFO	Fetching deployed resources
2020-10-23T21:48:53.504Z	INFO	Attempting to fetch deployed resource	{"kind": "Deployment", "namespacedName": "nexus-issue-179/nexus3"}
2020-10-23T21:48:53.504Z	DEBUG	Unable to find resource	{"kind": "Deployment", "namespacedName": "nexus-issue-179/nexus3"}
2020-10-23T21:48:53.504Z	INFO	Attempting to fetch deployed resource	{"kind": "Service", "namespacedName": "nexus-issue-179/nexus3"}
2020-10-23T21:48:53.505Z	DEBUG	Unable to find resource	{"kind": "Service", "namespacedName": "nexus-issue-179/nexus3"}
2020-10-23T21:48:53.505Z	INFO	Attempting to fetch deployed resource	{"kind": "Persistent Volume Claim", "namespacedName": "nexus-issue-179/nexus3"}
2020-10-23T21:48:53.505Z	DEBUG	Unable to find resource	{"kind": "Persistent Volume Claim", "namespacedName": "nexus-issue-179/nexus3"}
2020-10-23T21:48:53.505Z	INFO	Attempting to fetch deployed resource	{"kind": "Role Binding", "namespacedName": "nexus-issue-179/nexus3"}
E1023 21:48:53.507358       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.RoleBinding: failed to list *v1.RoleBinding: rolebindings.rbac.authorization.k8s.io is forbidden: User "system:serviceaccount:nexus-operator-system:default" cannot list resource "rolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
E1023 21:48:55.046761       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.RoleBinding: failed to list *v1.RoleBinding: rolebindings.rbac.authorization.k8s.io is forbidden: User "system:serviceaccount:nexus-operator-system:default" cannot list resource "rolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
E1023 21:48:56.799006       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.RoleBinding: failed to list *v1.RoleBinding: rolebindings.rbac.authorization.k8s.io is forbidden: User "system:serviceaccount:nexus-operator-system:default" cannot list resource "rolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
E1023 21:49:01.578523       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.RoleBinding: failed to list *v1.RoleBinding: rolebindings.rbac.authorization.k8s.io is forbidden: User "system:serviceaccount:nexus-operator-system:default" cannot list resource "rolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
E1023 21:49:13.917169       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.RoleBinding: failed to list *v1.RoleBinding: rolebindings.rbac.authorization.k8s.io is forbidden: User "system:serviceaccount:nexus-operator-system:default" cannot list resource "rolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope
E1023 21:49:38.947388       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.RoleBinding: failed to list *v1.RoleBinding: rolebindings.rbac.authorization.k8s.io is forbidden: User "system:serviceaccount:nexus-operator-system:default" cannot list resource "rolebindings" in API group "rbac.authorization.k8s.io" at the cluster scope

We might be missing this role definition?

@LCaparelli
Copy link
Member Author

@ricardozanini Added the list verb to SCC, clusterrole and rolebinding. Also registered SCC's API group in main's init().

Pushed a new image as well, just run:

kubectl apply -f https://gist.githubusercontent.com/LCaparelli/2af410ec9bb6e1248775eb970925da17/raw/2a32d8e095c56bf452ff231e200099e57e38fccc/nexus-operator.yaml

@ricardozanini
Copy link
Member

ricardozanini commented Oct 24, 2020

Still failing, haven't you added the SecurityContextConstraints object to the role?

oc describe clusterrole nexus-operator-manager-role
Name:         nexus-operator-manager-role
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"annotations":{},"creationTimestamp":null,"name":"nexus-oper...
PolicyRule:
  Resources                              Non-Resource URLs  Resource Names  Verbs
  ---------                              -----------------  --------------  -----
  events                                 []                 []              [create delete get list patch update watch]
  persistentvolumeclaims                 []                 []              [create delete get list patch update watch]
  secrets                                []                 []              [create delete get list patch update watch]
  serviceaccounts                        []                 []              [create delete get list patch update watch]
  services                               []                 []              [create delete get list patch update watch]
  nexus.apps.m88i.io                     []                 []              [create delete get list patch update watch]
  deployments.apps                       []                 []              [create delete get list patch update watch]
  ingresses.networking.k8s.io            []                 []              [create delete get list patch update watch]
  routes.route.openshift.io              []                 []              [create delete get list patch update watch]
  clusterrole.rbac.authorization.k8s.io  []                 []              [create get list update watch]
  rolebinding.rbac.authorization.k8s.io  []                 []              [create get list update watch]
  scc.security.openshift.io              []                 []              [create get list update watch]
  configmaps                             []                 []              [create get]
  servicemonitors.monitoring.coreos.com  []                 []              [create get]
  nexus.apps.m88i.io/status              []                 []              [get patch update]
  pods                                   []                 []              [get]
  replicasets.apps                       []                 []              [get]
  deployments.apps/finalizers            []                 []              [update]

Error:

E1024 12:33:15.462604       1 reflector.go:127] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:156: Failed to watch *v1.SecurityContextConstraints: failed to list *v1.SecurityContextConstraints: securitycontextconstraints.security.openshift.io is forbidden: User "system:serviceaccount:nexus-operator-system:default" cannot list resource "securitycontextconstraints" in API group "security.openshift.io" at the cluster scope

Not sure if scc.security.openshift.io and SecurityContextConstraints are the same thing:

oc get crds | grep security
securitycontextconstraints.security.openshift.io            2020-07-12T05:09:09Z

:)

@ricardozanini
Copy link
Member

ricardozanini commented Oct 24, 2020

After applying the suggestions locally, the operator pod just die in here:

2020-10-24T12:47:45.970Z	INFO	controllers.Nexus	Reconciling Nexus
2020-10-24T12:47:45.988Z	DEBUG	Automatic Updates are enabled, but no minor was informed. Fetching the most recent...
2020/10/24 12:47:45 registry.ping url=https://registry.hub.docker.com/v2/
2020-10-24T12:47:48.339Z	INFO	Registry: registry.tags url=[https://registry.hub.docker.com/v2/sonatype/nexus3/tags/list sonatype/nexus3] repository=%!s(MISSING)
2020-10-24T12:47:50.127Z	DEBUG	Fetching the latest micro from minor	{"MinorVersion": 28}
2020-10-24T12:47:50.127Z	DEBUG	Replacing 'spec.image'	{"OldImage": "docker.io/sonatype/nexus3", "NewImage": "docker.io/sonatype/nexus3:3.28.1"}
2020-10-24T12:47:50.134Z	INFO	Generating required resources
2020-10-24T12:47:50.134Z	DEBUG	Generating required resource	{"kind": "Deployment"}
2020-10-24T12:47:50.134Z	DEBUG	Generating required resource	{"kind": "Service"}
2020-10-24T12:47:50.134Z	DEBUG	Generating required resource	{"kind": "Service Account"}
2020-10-24T12:47:50.134Z	DEBUG	Generating required resource	{"kind": "Secret"}
2020-10-24T12:47:50.134Z	DEBUG	Generating required resource	{"kind": "Security Context Constraint"}
2020-10-24T12:47:50.134Z	DEBUG	Generating required resource	{"kind": "Cluster Role"}
2020-10-24T12:47:50.134Z	INFO	Fetching deployed resources
2020-10-24T12:47:50.134Z	INFO	Attempting to fetch deployed resource	{"kind": "Deployment", "namespacedName": "nexus-issue-179/nexus3"}
2020-10-24T12:47:50.134Z	DEBUG	Unable to find resource	{"kind": "Deployment", "namespacedName": "nexus-issue-179/nexus3"}
2020-10-24T12:47:50.134Z	INFO	Attempting to fetch deployed resource	{"kind": "Service", "namespacedName": "nexus-issue-179/nexus3"}
2020-10-24T12:47:50.134Z	DEBUG	Unable to find resource	{"kind": "Service", "namespacedName": "nexus-issue-179/nexus3"}
2020-10-24T12:47:50.134Z	INFO	Attempting to fetch deployed resource	{"kind": "Persistent Volume Claim", "namespacedName": "nexus-issue-179/nexus3"}
2020-10-24T12:47:50.134Z	DEBUG	Unable to find resource	{"kind": "Persistent Volume Claim", "namespacedName": "nexus-issue-179/nexus3"}
2020-10-24T12:47:50.134Z	INFO	Attempting to fetch deployed resource	{"kind": "Secret", "namespacedName": "nexus-issue-179/nexus3"}

Maybe a problem during SCC creation/config.

@ricardozanini
Copy link
Member

Since this object MUST be created on every OpenShift environment, we should have a nexus-operator-ocp.yaml file instead with the required objects and let the installer create the SCC.

You can't treat SCC like an usual k8s object since it's cluster wide. Our controller can't own it, since if you have more than one nexus instance, it will fail to manage it...

@LCaparelli
Copy link
Member Author

Yeah, that seems to be the way to go. Alternatively, we could try and create it outside of the reconcile loop, perhaps during startup, but... Honestly, I think that including it with the installation manifests is better.

The problem here was that we assumed that since the operator is cluster-scoped now we should be able to manage cluster-scoped resources from the reconcile loop, but any objects created during the reconcile will be owned by the Nexus instance being reconciled and since Nexus is namespace-scoped resource, it fails with:

2020-10-26T18:16:16.811-0300 ERROR controller Reconciler error {"reconcilerGroup": "apps.m88i.io", "reconcilerKind": "Nexus", "controller": "nexus", "name": "nexus3", "namespace": "myproject", "error": "cluster-scoped resource must not have a namespace-scoped owner, owner's namespace myproject"}

Too bad, it would have been really nice to fully manage the resources from the reconcile loop.

I'll make the necessary changes, need to do some reading on how kustomize works and see what I can come up with.

@ricardozanini
Copy link
Member

I think this error that I see on OCP is related to #187 :)

Nevertheless, we should add a scc-openshift.yaml file to our release page.

@ricardozanini ricardozanini modified the milestones: v0.4.0, v0.5.0 Nov 9, 2020
@LCaparelli
Copy link
Member Author

LCaparelli commented Nov 17, 2020

After a lot of research, I believe we should go ahead with creating the SCC and ClusterRole during installation (as opposed to during reconciliation) with some caveats.

Done some testing around and we could create these resources in the reconcile loop, but that means they won't have an ownerRef. This is a big deal because then they won't be garbage collected as it relies on this field to work.

This would be very similar to what we already have with many necessary resources present in operator.yaml: we set no explicit ownership in it and administrators must be on their toes to not leave orphan resources behind. Well, then why not create them as owner-less resources during the reconcile loop if that's already similar to what we're doing?

It's not that simple, because although we don't put an owner ourselves, when installed via OLM (which should really be the "canon" way to install) the CSV becomes the owner of these resources in operator.yaml. When uninstalling the operator via OLM, these owner-less resources would be left behind.

If we were to simply create the SCC in the reconcile loop, how could we tell for sure if a valid owner for it exists or not? Coupling the reconcile logic with this seems awkward, these resources are not related to business logic at all, it's just infrastructure scaffolding.

Handling this at installation is not super pretty either, but I believe it's the lesser of two evils. Putting them directly in operator.yaml comes across two issues:

  1. SCCs only exist on OCP, this will lead to an error on K8s installation (a harmless error, surely, but still)
  2. The ClusterRole is only necessary if on OCP (it links svc accounts to SCCs), it makes no sense to have it on K8s

Both of these are pretty harmless, but we could fix them by serving two different CSVs (or operator.yaml without OLM -- say operator-ocp.yaml), one for OCP and one for K8s. Seems like a lot of work too, but hopefully some smart hacking of existing scripts can serve this purpose.

@ricardozanini wdyt?

@ricardozanini
Copy link
Member

@LCaparelli we can easily clean up the orphan resources with finalizers. I'd say to register a finalizer with our Nexus CR resource to check if it's the last one in the cluster. If it's, we delete the SCC. So we can do everything programmatically and won't need to keep separated YAML files.

See: https://www.openshift.com/blog/kubernetes-operators-best-practices

If resources are not owned by the CR controlled by your operator but action needs to be taken when that CR is deleted, you must use a finalizer.

@ricardozanini ricardozanini modified the milestones: v0.5.0, v0.6.0 Dec 9, 2020
@LCaparelli LCaparelli modified the milestones: v0.6.0, v0.7.0 Jun 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement 👑 New feature or request WIP 👷‍♀️ Work In Progress PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add SCC to the Service Account in OCP
2 participants