Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay starting zone-network-service until in.ndpd is online #6982

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

citrus-it
Copy link
Contributor

@citrus-it citrus-it commented Nov 1, 2024

There is a race condition between being able to configure an
addrconf address on an interface, and the NDP daemon starting up.
If they start in parallel, there is a chance that configuring
network interfaces will fail resulting in a broken zone. There
is a fix to implement better handling of the conflict in stlouis
but we should also just wait for the NDP service to come up before
attempting to configure interfaces.

Fixes #6947 (along with stlouis#650)

There is a race condition between being able to configure an
addrconf address on an interface, and the NDP daemon starting up.
If they start in parallel, there is a chance that configuring
network interfaces will fail resulting in a broken zone. There
is a fix to implement better handling of the conflict in stlouis
but we should also just wait for the NDP service to come up before
attempting to configure interfaces.

Fixes #6947 (along with stlouis#650)
@jclulow
Copy link
Collaborator

jclulow commented Nov 3, 2024

Are we sure the OS bug is fixed, or will this just paper over it and we won't know for sure?

Copy link
Contributor

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@jclulow I think the OS fix is this one https://code.oxide.computer/c/illumos-gate/+/477

@citrus-it
Copy link
Contributor Author

citrus-it commented Nov 4, 2024

Are we sure the OS bug is fixed, or will this just paper over it and we won't know for sure?

My perspective is that this is the true fix, with the OS change resolving a race condition that shouldn't be invoked any more. I've spun up the control plane several times with just the OS change in place and not seen any zones fail to come up, but let's defer integrating this PR until after the next dogfood update just to add confidence in that.

@morlandi7 morlandi7 added this to the 12 milestone Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

crucible zone in rack2 found offline after zone setup failure
5 participants