Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Deletion Modes] Introduce a managed field for each module in the Kyma CR Modules List #1566

Closed
4 tasks done
nesmabadr opened this issue May 21, 2024 · 7 comments · Fixed by #1652 or #1658
Closed
4 tasks done
Assignees
Labels
API Denotes that an issue is tied to the potential change of the API kind/feature Categorizes issue or PR as related to a new feature.

Comments

@nesmabadr
Copy link
Contributor

nesmabadr commented May 21, 2024

Description

In order to support deletion modes for modules, there should be a flag while enabling each module in the Kyma CR to indicate whether the module is managed or not.

Reasons

To support unmanaged modules.

Acceptance Criteria

  • Add optional required Managed field to each module in the Kyma CR modules list
  • The field should be boolean; default true
  • Write documentation in API docs
  • Update issue comment to ensure the newly introduced field is mandatory in new API version: Modularization API Upgrade #776 (comment)
    • see comments below

Feature Testing

No response

Testing approach

No response

Attachments

No response

@nesmabadr nesmabadr added the kind/feature Categorizes issue or PR as related to a new feature. label May 21, 2024
@nesmabadr nesmabadr changed the title [Module Deletion Modes] Introduce a DeletionMode flag for each module in the Kyma CR Modules list [Deletion Modes] Introduce a DeletionMode flag for each module in the Kyma CR Modules list May 21, 2024
@Tomasz-Smelcerz-SAP
Copy link
Member

Tomasz-Smelcerz-SAP commented May 23, 2024

I have two question marks here.

  1. How will we introduce this change? I'd opt for a solution that allows for smooth migration, instead of just deploying a new version and changing the API contract in an instant. It means we should have three values for the new flag: Ignore, Blocking and Legacy. Legacy means the deletion works as it is now. Of course we don't have to use third flag value, it is just an example - what we need is the "Legacy" mode, how we configure it, is a secondary issue.
    Benefits:

    • We can merge the PR anytime we want if the Legacy mode is default. We can even merge it without extending the docs or announcing anything - we're not breaking any contract.
    • We are more elastic in terms of migration scenarios. Some users / module teams may not be yet "ready" to switch to the new deletion model (for whatever reason), but others may adopt it faster. We may migrate "easy cases" fast and then focus on better supporting the few remaining "hard cases"
    • It is the best industry practice to introduce changes in a non-disruptive manner.

    Drawbacks:

    • Tiny implementation effort related to keeping a third deletion strategy (Legacy)
  2. If the "deletionMode" is a flag that exists on the module list in the Kyma CR, like this:

[...]
modules:
  - name: "foo"
    channel: "fast"
    deletionMode: "Ignore"

Then: is this field mutable?
If so, then what happens in the following case:

  • the User creates the Kyma CR with the module foo on the list having the deletionMode: Ignore, as in the example above.
  • the Kyma CR gets Ready
  • the User changes the deletionMode for the foo module to Blocking
  • after 100ms the API Server persists the changes in the ETCD, so the change is "done" from user's perspective, but the Lifecycle-Manager has not yet processed the object update event
  • the user immediately deletes the module from the list, essentially creating a second object update event. We have now two object update events for the same Kyma CR, the Lifecycle Manager hasn't processed any of these yet.

How should Lifecycle-Manager process this deletion? Using the Ignore strategy or the Blocking strategy?

@nesmabadr nesmabadr changed the title [Deletion Modes] Introduce a DeletionMode flag for each module in the Kyma CR Modules list [Deletion Modes] Introduce a DeletionMode flag for each module in the Kyma CR Modules and ModuleStatus May 28, 2024
@nesmabadr nesmabadr changed the title [Deletion Modes] Introduce a DeletionMode flag for each module in the Kyma CR Modules and ModuleStatus [Deletion Modes] Introduce a DeletionMode flag for each module in the Kyma CR Modules List and ModuleStatus May 28, 2024
@nesmabadr nesmabadr changed the title [Deletion Modes] Introduce a DeletionMode flag for each module in the Kyma CR Modules List and ModuleStatus [Deletion Modes] Introduce a managed field for each module in the Kyma CR Modules List Jun 5, 2024
@jeremyharisch jeremyharisch added the API Denotes that an issue is tied to the potential change of the API label Jun 27, 2024
@c-pius c-pius self-assigned this Jun 27, 2024
@c-pius
Copy link
Contributor

c-pius commented Jun 28, 2024

Regarding the field being optional with default true, it turns out that this is not entirely straight forward to implement. I tried various alternatives and we need to choose one.

Non-option

// +kubebuilder:default:=true
Managed bool `json:"managed,omitempty"`

This does NOT work. When setting managed: false, the omitempty will strip it from the object and when reading it, it is defaulting it back to true.

Option 1

// +kubebuilder:default:=true
Managed bool `json:"managed"`

This will make managed a mandatory field. It is still a compatible change since we provide the default true.

  • existing KymaCRs with enabled modules will be reconciled and have an explicit managed:true flag
  • when enabling modules without the flag being set, it will be defaulted to managed:true

Option 2

If we want to have a behavior where the field can be omitted, and a missing field defaulting to true, we would need to implement it as follows:

// note no default
Managed *bool `json:"managed,omitempty"`

func (m *Module) IsManaged() bool {
	if m.Managed == nil {
		return true
	}

	return *m.Managed
}

The empty value is a nil pointer. We use a helper method that defaults the nil pointer to true.

  • existing KymaCRs with enabled modules will remain to have managed unset (nil pointer), the IsManaged() function default this to true in the code (direct field access will give a nil pointer)
  • when enabling modules without the flag being set, the IsManaged() function default this to true in the code (direct field access will give a nil pointer)
  • when setting the field explicitly, the IsManaged() function will return the respective value

Personally, I would favor Option 1 as it is more explicit compared to internally handling an unset value / nil pointer as true. Also, we do want to make the attribute mandatory anyway.

@c-pius
Copy link
Contributor

c-pius commented Jun 28, 2024

Re. API compatibility of Option 2:

I think it should be good. I tested the behavior multiple times now. When shipping the new fields (updating lifecycle manager and Kyma CRD at the sime time), the KymaCR in KCP gets an entry with managed:true. The lifecycle-manager picks this up and updates CRD and CR in the SKR accordingly. Consequently, all KymaCRs will have a managed:true entry after we ship this update. And from behavior / semantic point of view we should be good as managed:true expresses the behavior that we supported so far without this field existing.

I also tried the same thing with keeping the current lifecycle-manager, and only updating the CRD. Not knowing the field, lifecycle-manager seems to ignore it (no crashes), but it syncs the CRD correctly which also leads to CRs having managed:true defaulted. Here we can further observe that when changing to managed:false, it will flap back to managed:true. This should be because of the behavior described for #4 below.

Looking at the literature... it is a bit tricky to be exactly sure... Here it mentions that fields can be added given that reasonable defaults are defined:

Adding a field

In a hypothetical API (assume we're at version v6), the Frobber struct looks
something like this:

// API v6.
type Frobber struct {
  Height int    `json:"height"`
  Param  string `json:"param"`
}

You want to add a new Width field. It is generally allowed to add new fields
without changing the API version
, so you can simply change it to:

// Still API v6.
type Frobber struct {
  Height int    `json:"height"`
  Width  int    `json:"width"`
  Param  string `json:"param"`
}

The onus is on you to define a sane default value for Width such that rules
#1 and #2 above are true - API calls and stored objects that used to work must
continue to work.


Here it however lists adding a required field as an incompatible change. However, I think we are good on points #1 and #2 (all API calls will remain to work as expected).

One point that may be problematic in theory: "4. It must be possible to round-trip your change (convert to different API versions and back) with no loss of information.". If we had a scenario where a component not aware of the new managed field, the following could happen: Module with managed: false is processed by old component, old component updates it in the cluster and doesn't send the managed field. Field is consequently updated to managed:true by the API server. Still, since we only have lifecycle-manager working on the resource and we update it the same time we ship the new field, I think this should not be a real concern. Also, I checked the resources in stage and prod and all of them are on v1beta2 already.

On compatibility

Before talking about how to make API changes, it is worthwhile to clarify what
we mean by API compatibility. Kubernetes considers forwards and backwards
compatibility of its APIs a top priority. Compatibility is hard, especially
handling issues around rollback-safety. This is something every API change
must consider.

An API change is considered compatible if it:

  • adds new functionality that is not required for correct behavior (e.g.,
    does not add a new required field
    )
  • does not change existing semantics, including:
    • the semantic meaning of default values and behavior
    • interpretation of existing API types, fields, and values
    • which fields are required and which are not
    • mutable fields do not become immutable
    • valid values do not become invalid
    • explicitly invalid values do not become valid

Put another way:

1. Any API call (e.g. a structure POSTed to a REST endpoint) that succeeded
before your change must succeed after your change.

2. Any API call that does not use your change must behave the same as it did
before your change.

3. Any API call that uses your change must not cause problems (e.g. crash or
degrade behavior) when issued against an API servers that do not include your
change.
4. It must be possible to round-trip your change (convert to different API
versions and back) with no loss of information.

5. Existing clients need not be aware of your change in order for them to
continue to function as they did previously, even when your change is in use.
6. It must be possible to rollback to a previous version of API server that
does not include your change and have no impact on API objects which do not use
your change. API objects that use your change will be impacted in case of a
rollback.

If your change does not meet these criteria, it is not considered compatible,
and may break older clients, or result in newer clients causing undefined
behavior. Such changes are generally disallowed, though exceptions have been
made in extreme cases (e.g. security or obvious bugs).

Let's consider some examples.

@ruanxin
Copy link
Contributor

ruanxin commented Jun 28, 2024

summary:
it's capability to introduce this Managed filed as required to v1beta2:

  1. we configured default boolean value true to it, so all the existing CR have this value assigned.
  2. the default true not break the contract with the exiting CR, since all enabled module is managed.

@c-pius
Copy link
Contributor

c-pius commented Jul 3, 2024

@c-pius
Copy link
Contributor

c-pius commented Jul 5, 2024

Re-opening this as we have an edge case to clarify... A key assumption is, that we can never know when a reconciliation of the KymaCR will happen (e.g., due to delays of KLM) and that a KymaCR may be changed multiple times by the customer before the next reconciliation happens.

Problematic scenario: Given a customers sets a module to managed:false, and shortly after deletes the whole module entry from the KymaCR, both happening before the next reconciliation via KLM happens, we loose the information that the customer had set managed:false. Consequently, we will delete the module as a managed module and remove all resources from the SKR.

// T1 last reconciled state
spec:
  modules:
  - name: template-operator
    managed: true
  // ...
---
//T2 no reconciliation happened on this state
spec:
  modules:
  - name: template-operator
    managed: false
  // ...
---
//T3 next reconciliation after the last one at T1
spec:
  modules:
  // ...

When KLM fetches the current state at T3, we only see that the module is gone, but we don't know that it had been previously set to managed: false.

TODOs

  • clarify whether this is an edge case that we can accept
  • come up with a mitigation

@c-pius c-pius reopened this Jul 5, 2024
@c-pius
Copy link
Contributor

c-pius commented Jul 5, 2024

  • come up with a mitigation

discussed with the team to keep this field and go forward with the approach defined in #1427

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Denotes that an issue is tied to the potential change of the API kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
7 participants