-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crucible zone in rack2 found offline after zone setup failure #6947
Comments
After a lot of debugging with @rcgoodfellow and @ahl, we believe this is caused by https://www.illumos.org/issues/15782. The crux of the issue is that:
We've put a core from
That causes this code in
Most of the state here is cleared, but the Again, we're not sure how the system got into this state, but we are pretty confident that there is some race that prevents |
A simple repro for this issue is #!/bin/bash
set -e
i=0
while true; do
svcadm restart ndp
ipadm create-addr -t -T addrconf vnic0/ll
ipadm delete-if vnic0
i=$((i+1))
printf "\r$i"
done This reproduces as of helios-2.0.22997 |
There is a race condition between being able to configure an addrconf address on an interface, and the NDP daemon starting up. If they start in parallel, there is a chance that configuring network interfaces will fail resulting in a broken zone. There is a fix to implement better handling of the conflict in stlouis but we should also just wait for the NDP service to come up before attempting to configure interfaces. Fixes #6947 (along with stlouis#650)
There is a race condition between being able to configure an addrconf address on an interface, and the NDP daemon starting up. If they start in parallel, there is a chance that configuring network interfaces will fail resulting in a broken zone. There is a fix to implement better handling of the conflict in stlouis but we should also just wait for the NDP service to come up before attempting to configure interfaces. Fixes #6947 (along with stlouis#650)
I had a provision hang: dogfood instance
c49820df-25ea-4c0f-aeb2-5e16980b6e84
, saga311f4acf-798d-435e-9f1c-d8ef75977aac
, which is apparently stuck inregions_ensure
(based on it being node 173 that has started but not finished and looking in the saga_dag). From the Nexus log, this went bad around here:That zone did not come up properly:
That log file is rotated out and there are subsequent empty ones but here's the last non-empty one:
It seems like there are a few issues here:
ipadm
error -- I'm not sure what this wasI'm not sure if these are issues:
The text was updated successfully, but these errors were encountered: