Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This is not a real PR, it's a collection of TODOs #46

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 36 additions & 16 deletions internal/controller/shim_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,12 @@
"sigs.k8s.io/controller-runtime/pkg/predicate"
"sigs.k8s.io/controller-runtime/pkg/reconcile"

//TODO(flavio) we should not rename this import to kwasmv1

Check failure on line 41 in internal/controller/shim_controller.go

View workflow job for this annotation

GitHub Actions / Golangci-lint

commentFormatting: put a space between `//` and comment text (gocritic)
kwasmv1 "github.com/spinkube/runtime-class-manager/api/v1alpha1"
)

const (
//TODO(flavio): probably should be renamed

Check failure on line 46 in internal/controller/shim_controller.go

View workflow job for this annotation

GitHub Actions / Golangci-lint

commentFormatting: put a space between `//` and comment text (gocritic)
KwasmOperatorFinalizer = "kwasm.sh/finalizer"
INSTALL = "install"
UNINSTALL = "uninstall"
Expand Down Expand Up @@ -136,6 +138,11 @@

// 4. Deploy job to each node in list
if len(nodes.Items) > 0 {
// TODO(flavio): the handleInstallShim returns an error as soon as the
// creationg of one of the CronJob fails. We create on CronJob per node,
// hence failing on node 5/10 will cause this reconciliation loop to
// exit immediately. However, the shim finalizer has not been set yet,
// that happens at end of this `if` block.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose to change the handleInstallShim to return an array of errors, so that all the nodes get the chance to get the shim installed on them.

Then OFC we need to ensure the finalizer is always set, even if there was an error

Copy link

@tillknuesting tillknuesting Mar 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that calling deployJobOnNode should happen concurrently, and potential errors should be returned aggregated to help implement potential strategies, e.g., retries later on if needed.

Also, it would indeed make sense to call ensureFinalizerForShim even if an error happens:


    // Ensure the finalizer is called even if a return happens before
    defer func() {
        err := sr.ensureFinalizerForShim(ctx, &shimResource, KwasmOperatorFinalizer)
        if err != nil {
            log.Error().Msgf("Failed to ensure finalizer: %s", err)
        }
    }()

_, err = sr.handleInstallShim(ctx, &shimResource, nodes)
if err != nil {
return ctrl.Result{}, err
Expand Down Expand Up @@ -208,6 +215,7 @@
func (sr *ShimReconciler) handleInstallShim(ctx context.Context, shim *kwasmv1.Shim, nodes *corev1.NodeList) (ctrl.Result, error) {
log := log.Ctx(ctx)

// TODO(flavio): it would be nice to have constants describing the rollout strategy types
switch shim.Spec.RolloutStrategy.Type {
case "rolling":
{
Expand All @@ -219,14 +227,20 @@
for i := range nodes.Items {
node := nodes.Items[i]

// TODO(flavio): it would be nice to have constants describing the shim status
shimProvisioned := node.Labels[shim.Name] == "provisioned"
shimPending := node.Labels[shim.Name] == "pending"

// TODO(flavio): use a switch - see below
if !shimProvisioned && !shimPending {
err := sr.deployJobOnNode(ctx, shim, node, INSTALL)
if err != nil {
return ctrl.Result{}, err
}
} else {
// TODO(flavio): we're here when the shim is either provisioned or is pending,
// but the info message is about the shim being already provisioned on the nodes,
// which is not true -> we should use a switch
log.Info().Msgf("Shim %s already provisioned on Node %s", shim.Name, node.Name)
}
}
Expand All @@ -237,6 +251,8 @@
}
}

// TODO(flavio): it looks like the rolling strategy is not implemented, am I right?

return ctrl.Result{}, nil
}

Expand All @@ -255,6 +271,7 @@

switch jobType {
case INSTALL:
// TODO(flavio): use const to track shim status
err := sr.updateNodeLabels(ctx, &node, shim, "pending")
if err != nil {
log.Error().Msgf("Unable to update node label %s: %s", shim.Name, err)
Expand Down Expand Up @@ -288,6 +305,7 @@
// We rely on controller-runtime to rate limit us.
if err := sr.Client.Patch(ctx, job, patchMethod, patchOptions); err != nil {
log.Error().Msgf("Unable to reconcile Job %s", err)
// TODO(flavio): maybe we should use a constant here
if err := sr.updateNodeLabels(ctx, &node, shim, "failed"); err != nil {
log.Error().Msgf("Unable to update node label %s: %s", shim.Name, err)
}
Expand All @@ -314,10 +332,6 @@
nameMax := int(math.Min(float64(len(name)), 63))

job := &batchv1.Job{
TypeMeta: metav1.TypeMeta{
Kind: "Job",
APIVersion: "batch/v1",
},
phyrog marked this conversation as resolved.
Show resolved Hide resolved
ObjectMeta: metav1.ObjectMeta{
Name: name[:nameMax],
Namespace: os.Getenv("CONTROLLER_NAMESPACE"),
Expand Down Expand Up @@ -347,6 +361,11 @@
},
}},
Containers: []corev1.Container{{
// TODO(flavio): use official container image - we probably have to make
// that somehow configurable - think about airgap environments
// TODO(flavio): Why is the operation part of the image name? IMHO the image should be
// be versioned, the action to perform could be chosen by using a
// different entrypoint
Image: "voigt/kwasm-node-installer:" + operation,
Name: "provisioner",
SecurityContext: &corev1.SecurityContext{
Expand All @@ -358,6 +377,8 @@
Value: "/mnt/node-root",
},
{
// TODO(flavio): create issue to track definition of other fetch strategies
// and their implementation
Name: "SHIM_LOCATION",
Value: shim.Spec.FetchStrategy.AnonHTTP.Location,
},
Expand All @@ -370,6 +391,8 @@
Value: shim.Spec.RuntimeClass.Handler,
},
{
// TODO(flavio): this seems like an odd value for the fetch
// strategy
Name: "SHIM_FETCH_STRATEGY",
Value: "/mnt/node-root",
},
Expand All @@ -387,6 +410,9 @@
},
}

// TODO(flavio): why are we setting the owner reference only for INSTALL jobs?
// Shouldn't we own also the uninstall ones, these should not be owneed by the shim,
// but could be owned by some other resource that is part of the controller
if operation == INSTALL {
if err := ctrl.SetControllerReference(shim, job, sr.Scheme); err != nil {
return nil, fmt.Errorf("failed to set controller reference: %w", err)
Expand All @@ -413,7 +439,6 @@
FieldManager: "shim-operator",
}

// Note that we reconcile even if the deployment is in a good state. We rely on controller-runtime to rate limit us.
if err := sr.Client.Patch(ctx, runtimeClass, patchMethod, patchOptions); err != nil {
log.Error().Msgf("Unable to reconcile RuntimeClass %s", err)
return ctrl.Result{}, fmt.Errorf("failed to reconcile RuntimeClass: %w", err)
Expand All @@ -433,14 +458,9 @@
}

runtimeClass := &nodev1.RuntimeClass{
TypeMeta: metav1.TypeMeta{
Kind: "RuntimeClass",
APIVersion: "node.k8s.io/v1",
},
ObjectMeta: metav1.ObjectMeta{
Name: name[:nameMax],
Namespace: os.Getenv("CONTROLLER_NAMESPACE"),
Labels: map[string]string{name[:nameMax]: "true"},
Name: name[:nameMax],
Labels: map[string]string{name[:nameMax]: "true"},
phyrog marked this conversation as resolved.
Show resolved Hide resolved
},
Handler: shim.Spec.RuntimeClass.Handler,
Scheduling: &nodev1.Scheduling{
Expand All @@ -458,8 +478,7 @@
// handleDeleteShim deletes all possible child resources of a Shim. It will ignore NotFound errors.
func (sr *ShimReconciler) handleDeleteShim(ctx context.Context, shim *kwasmv1.Shim, nodes *corev1.NodeList) error {
// deploy uninstall job on every node in node list
for i := range nodes.Items {
node := nodes.Items[i]
for _, node := range nodes.Items {

Check failure on line 481 in internal/controller/shim_controller.go

View workflow job for this annotation

GitHub Actions / Golangci-lint

unnecessary leading newline (whitespace)

if _, exists := node.Labels[shim.Name]; exists {
err := sr.deployJobOnNode(ctx, shim, node, UNINSTALL)
Expand All @@ -474,14 +493,15 @@
}

func (sr *ShimReconciler) getNodeListFromShimsNodeSelctor(ctx context.Context, shim *kwasmv1.Shim) (*corev1.NodeList, error) {
//TODO (flavio): probably eager optimization - do some pagination?

Check failure on line 496 in internal/controller/shim_controller.go

View workflow job for this annotation

GitHub Actions / Golangci-lint

commentFormatting: put a space between `//` and comment text (gocritic)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know of clusters with 100s of nodes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with 5k nodes, it's likely less than 10MB; hence, it's probably not worth it.

nodes := &corev1.NodeList{}
if shim.Spec.NodeSelector != nil {
err := sr.List(ctx, nodes, client.InNamespace(shim.Namespace), client.MatchingLabels(shim.Spec.NodeSelector))
err := sr.List(ctx, nodes, client.MatchingLabels(shim.Spec.NodeSelector))
phyrog marked this conversation as resolved.
Show resolved Hide resolved
if err != nil {
return &corev1.NodeList{}, fmt.Errorf("failed to get node list: %w", err)
}
} else {
err := sr.List(ctx, nodes, client.InNamespace(shim.Namespace))
err := sr.List(ctx, nodes)
phyrog marked this conversation as resolved.
Show resolved Hide resolved
if err != nil {
return &corev1.NodeList{}, fmt.Errorf("failed to get node list: %w", err)
}
Expand Down
Loading