Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancements/machine-config: add updates to PinnedImageSet #1599

Closed

Conversation

hexfusion
Copy link
Contributor

@hexfusion hexfusion commented Mar 15, 2024

This PR is a follow-up to #1481 adding changes to the planned v1alpha1 API and TechPreview 4.16 implementation .

@openshift-ci openshift-ci bot requested review from kbsingh and runcom March 15, 2024 15:19
@hexfusion
Copy link
Contributor Author

/cc @yuqi-zhang @sinnykumari @cdoern

Copy link
Contributor

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, the implementation makes sense to me and will not result in any major outward facing MCO changes! thanks for updating this

Copy link
Contributor

openshift-ci bot commented Mar 15, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cdoern
Once this PR has been reviewed and has the lgtm label, please assign stbenjam for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Mar 16, 2024

@hexfusion: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

`certificate_writer`. This new feature will operate independently of the
`MachineConfig` for setting configurations. It is advantageous to not use
`MachineConfig` because the configuration and image prefetching can happen
across the nodes in the pool in parallel. The new controller will be watching
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we are introducing PinnedImageSetController , which new controller we are referring to here?

Copy link
Contributor Author

@hexfusion hexfusion Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pinned_image_manager will clarify if its not MCD itself

Copy link
Contributor

@sinnykumari sinnykumari Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, it was a typo. I meant to ask, Now that we are no longer adding PinnedImageSetController , which new controller we are referring to here?

`pinned_image_manager` will:

1. Begin by marking the node with an annotation to indicate the manager is
`Working` utilizing the `nodeWriter`, similar to the `MachineConfigDaemon` `update`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: For better troubleshooting, In MCO implementation, wherever possible we would want to add some logging to make sure MCP status update is happening via pinned_image_manager.

selectors in the `PinnedImageSet` custom resources.
Once all the relevant images are successfully pinned and downloaded to the
matching nodes, the `pinned_image_manager` will signal the completion of the
process by invoking nodeWriter.SetDone(). This action notifies the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am overthinking. Calling nodeWriter.SetDone() would mean that we are creating two sources where anode is marked as update completed, so we need a way to ensure that we don't do that while MCD is also performing an update due to rendered-config change. Probably, this isn't an issue for targeted usecase but making this a general feature would require better thought process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will reword there is just a single sync of MCD so Done will only be applied by MCD. I was trying to convey that no additional mechanisms were required. But how that will work was not described will flesh that out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a similar question here, I guess we're sort of introducing an inter-dependency between the "machineconfig update" and the "pinnedimageset update" which, reading the controller code, shouldn't conflict unless there is a degrade. But this does sort of raise some clarifications such as:

  1. do we want one's failure to block the other
  2. if you roll out one pinnedimageset and then try to roll out a second one (i.e. append another set to the list in the MCP) should we be waiting for the first to finish
  3. if the daemon is working on a node, should it not reboot the node unless pinnedimageset is able to finish
  4. does the done state otherwise affect node updates
  5. how does the pinnedimageset controller actually know if the done state is reacting to the current set of pinnedimageset changes vs, say, the daemon is dead and hasn't reacted to the latest request yet?

@@ -17,7 +17,7 @@ api-approvers:
- "@deads2k"
- "@JoelSpeed"
creation-date: 2023-09-21
last-updated: 2023-09-21
last-updated: 2023-03-15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wrong year ;)

...
```
The PinnedImageSet is closely linked with the `MachineConfigPool`, and each
Custom Resource (CR) can be associated with a pool at the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you can reference the same pinnedimageset across different pools?

those is created, updated or deleted the controller will start a daemon set that
will do the following in each node of the cluster:
The _machine-config-daemon_ will grow a new `pinned_image_manager` utilizing
the same general flow as the existing `MAchineConfigDaemon` `certificate_writer`. This approach is not dependent on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: MAchineConfigDaemon -> MachineConfigDaemon

selectors in the `PinnedImageSet` custom resources.
Once all the relevant images are successfully pinned and downloaded to the
matching nodes, the `pinned_image_manager` will signal the completion of the
process by invoking nodeWriter.SetDone(). This action notifies the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a similar question here, I guess we're sort of introducing an inter-dependency between the "machineconfig update" and the "pinnedimageset update" which, reading the controller code, shouldn't conflict unless there is a degrade. But this does sort of raise some clarifications such as:

  1. do we want one's failure to block the other
  2. if you roll out one pinnedimageset and then try to roll out a second one (i.e. append another set to the list in the MCP) should we be waiting for the first to finish
  3. if the daemon is working on a node, should it not reboot the node unless pinnedimageset is able to finish
  4. does the done state otherwise affect node updates
  5. how does the pinnedimageset controller actually know if the done state is reacting to the current set of pinnedimageset changes vs, say, the daemon is dead and hasn't reacted to the latest request yet?

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 9, 2024
@hexfusion
Copy link
Contributor Author

/close will open new PR with updates

@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 17, 2024
@openshift-bot
Copy link

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this May 24, 2024
Copy link
Contributor

openshift-ci bot commented May 24, 2024

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants