Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crucible zone in rack2 found offline after zone setup failure #6947

Open
davepacheco opened this issue Oct 29, 2024 · 2 comments · May be fixed by #6982
Open

crucible zone in rack2 found offline after zone setup failure #6947

davepacheco opened this issue Oct 29, 2024 · 2 comments · May be fixed by #6982

Comments

@davepacheco
Copy link
Collaborator

I had a provision hang: dogfood instance c49820df-25ea-4c0f-aeb2-5e16980b6e84, saga 311f4acf-798d-435e-9f1c-d8ef75977aac, which is apparently stuck in regions_ensure (based on it being node 173 that has started but not finished and looking in the saga_dag). From the Nexus log, this went bad around here:

16:28:57.487Z DEBG 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): recording saga event
    event_type = Started
    node_id = 173
    saga_id = 311f4acf-798d-435e-9f1c-d8ef75977aac
16:28:57.509Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): Region requested, not yet created. Retrying in 295.586665ms
    dataset = 180d466d-eb36-4546-8922-e52c4c076823
    file = nexus/src/app/crucible.rs:230
    region = 33001801-f722-4323-bf54-280f22ba27a8
    saga_id = 311f4acf-798d-435e-9f1c-d8ef75977aac
    saga_name = instance-create
16:28:57.561Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): Region requested, not yet created. Retrying in 331.554523ms
    dataset = e238116d-e5cc-43d4-9c8a-6f138ae8a15d
    file = nexus/src/app/crucible.rs:230
    region = c0ca39f8-7fe0-492e-a382-936c1f33d5be
    saga_id = 311f4acf-798d-435e-9f1c-d8ef75977aac
    saga_name = instance-create
16:28:57.813Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): Region requested, not yet created. Retrying in 361.352408ms
    dataset = 180d466d-eb36-4546-8922-e52c4c076823
    file = nexus/src/app/crucible.rs:230
    region = 33001801-f722-4323-bf54-280f22ba27a8
    saga_id = 311f4acf-798d-435e-9f1c-d8ef75977aac
    saga_name = instance-create
16:29:12.498Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): saw transient communication error error sending request for url (http://[fd00:1122:3344:10a::b]:32345/crucible/0/regions), retrying...
    file = common/src/progenitor_operation_retry.rs:121
    saga_id = 311f4acf-798d-435e-9f1c-d8ef75977aac
    saga_name = instance-create
16:29:12.498Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): failed external call (ProgenitorError(Communication Error: error sending request for url (http://[fd00:1122:3344:10a::b]:32345/crucible/0/regions))), will retry in 137.416942ms
    file = common/src/progenitor_operation_retry.rs:174
    saga_id = 311f4acf-798d-435e-9f1c-d8ef75977aac
    saga_name = instance-create
16:29:27.644Z WARN 65a11c18-7f59-41ac-b9e7-680627f996e7 (ServerContext): saw transient communication error error sending request for url (http://[fd00:1122:3344:10a::b]:32345/crucible/0/regions), retrying...
    file = common/src/progenitor_operation_retry.rs:121
    saga_id = 311f4acf-798d-435e-9f1c-d8ef75977aac
    saga_name = instance-create

That zone did not come up properly:

BRM42220016 # svcs -Zxv
svc:/oxide/zone-network-setup:default (Oxide Zone Network Setup)
  Zone: oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c
 State: maintenance since Wed Oct 23 20:07:19 2024
Reason: Start method failed repeatedly, last exited with status 1.
   See: http://illumos.org/msg/SMF-8000-KS
   See: /pool/ext/86c58ea3-1413-4af3-9aff-9c0a3d758459/crypt/zone/oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c/root/var/svc/log/oxide-zone-network-setup:default.log
Impact: 1 dependent service is not running:
        svc:/oxide/crucible/agent:default

That log file is rotated out and there are subsequent empty ones but here's the last non-empty one:

BRM42220016 # cat /pool/ext/d4c6bdc6-5e99-4f6c-b57a-9bfcb9a76be4/crypt/debug/oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c/oxide-zone-network-setup:default.log.1729714039
[ Oct 23 20:07:09 Enabled. ]
[ Oct 23 20:07:09 Rereading configuration. ]
[ Oct 23 20:07:14 Rereading configuration. ]
[ Oct 23 20:07:14 Executing start method ("/opt/oxide/zone-setup-cli/bin/zone-setup common-networking -d oxControlService10 -s fd00:1122:3344:10a::b -g fd00:1122:3344:10a::1"). ]
note: configured to log to "/dev/stderr"
{"msg":"Ensuring IP interface exists on datalink","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:16.142871834Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":10907,"datalink":"oxControlService10"}
{"msg":"Setting MTU to 9000 for IPv6 and IPv4","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:16.240552047Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":10907,"datalink":"oxControlService10"}
{"msg":"Setting TCP recv_buf size to 1 MB","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:16.489292596Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":10907}
{"msg":"Setting TCP congestion control algorithm to cubic","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:16.612879477Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":10907}
{"msg":"Ensuring static and auto-configured addresses are set on the IP interface","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:16.694530786Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":10907,"static address":"fd00:1122:3344:10a::b","data link":"\"oxControlService10\""}
zone-setup: failed to ensure static address fd00:1122:3344:10a::b on datalink oxControlService10

Caused by:
    Command [/usr/sbin/ipadm create-addr -t -T addrconf oxControlService10/ll] executed and failed with status: exit status: 1  stdout: 
      stderr: ipadm: Could not create address: Addrconf already in progress
[ Oct 23 20:07:18 Method "start" exited with status 1. ]
[ Oct 23 20:07:18 Executing start method ("/opt/oxide/zone-setup-cli/bin/zone-setup common-networking -d oxControlService10 -s fd00:1122:3344:10a::b -g fd00:1122:3344:10a::1"). ]
note: configured to log to "/dev/stderr"
{"msg":"Ensuring IP interface exists on datalink","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:18.731988794Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11444,"datalink":"oxControlService10"}
{"msg":"Setting MTU to 9000 for IPv6 and IPv4","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:18.803297256Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11444,"datalink":"oxControlService10"}
{"msg":"Setting TCP recv_buf size to 1 MB","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:18.908654527Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11444}
{"msg":"Setting TCP congestion control algorithm to cubic","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:18.952228803Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11444}
{"msg":"Ensuring static and auto-configured addresses are set on the IP interface","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:19.012102346Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11444,"static address":"fd00:1122:3344:10a::b","data link":"\"oxControlService10\""}
zone-setup: failed to ensure static address fd00:1122:3344:10a::b on datalink oxControlService10

Caused by:
    Command [/usr/sbin/ipadm create-addr -t -T addrconf oxControlService10/ll] executed and failed with status: exit status: 1  stdout: 
      stderr: ipadm: Could not create address: Addrconf already in progress
[ Oct 23 20:07:19 Method "start" exited with status 1. ]
[ Oct 23 20:07:19 Executing start method ("/opt/oxide/zone-setup-cli/bin/zone-setup common-networking -d oxControlService10 -s fd00:1122:3344:10a::b -g fd00:1122:3344:10a::1"). ]
note: configured to log to "/dev/stderr"
{"msg":"Ensuring IP interface exists on datalink","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:19.478810164Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11568,"datalink":"oxControlService10"}
{"msg":"Setting MTU to 9000 for IPv6 and IPv4","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:19.520545249Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11568,"datalink":"oxControlService10"}
{"msg":"Setting TCP recv_buf size to 1 MB","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:19.588973381Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11568}
{"msg":"Setting TCP congestion control algorithm to cubic","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:19.617077438Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11568}
{"msg":"Ensuring static and auto-configured addresses are set on the IP interface","v":0,"name":"zone-setup","level":30,"time":"2024-10-23T20:07:19.659951889Z","hostname":"oxz_crucible_26fb3830-898e-4086-afaf-8f9654716b8c","pid":11568,"static address":"fd00:1122:3344:10a::b","data link":"\"oxControlService10\""}
zone-setup: failed to ensure static address fd00:1122:3344:10a::b on datalink oxControlService10

Caused by:
    Command [/usr/sbin/ipadm create-addr -t -T addrconf oxControlService10/ll] executed and failed with status: exit status: 1  stdout: 
      stderr: ipadm: Could not create address: Addrconf already in progress
[ Oct 23 20:07:19 Method "start" exited with status 1. ]

It seems like there are a few issues here:

  • the saga action doesn't log what the transient error was
  • whatever caused the ipadm error -- I'm not sure what this was

I'm not sure if these are issues:

  • Maybe we shouldn't retry forever when we know the user is waiting for something? I know this is tricky though because the saga needs a known result and in this case we don't know if something partially succeeded so we don't know if we need to clean it up.
  • Maybe we shouldn't try to use something that we could tell isn't working? But on the other hand there's always a time-of-check-to-time-of-use and it does seem better to just try regardless and handle it gracefully if it's failed. But this pattern interacts badly with the "the saga can't tell if it started anything".
  • Nothing in the system noticed this was a problem? But this is maybe just a special case of "we have no way to notify an operator when things aren't working".
@bnaecker
Copy link
Collaborator

After a lot of debugging with @rcgoodfellow and @ahl, we believe this is caused by https://www.illumos.org/issues/15782.

The crux of the issue is that:

  • zone-setup attempts to create an addrconf address over its data link: ipadm create-addr -t -T addrconf oxControlService0/ll
  • that makes a call into libipadm to create an interface and address object
  • the last step is "duplicate address detection", in which ipadm asks in.ndpd, then NDP daemon, to check that this address is available on the local network (no other directly-connected links have the same address)
  • through an as-yet-unexplained synchronization issue, in.ndpd believes it is already doing duplicate address detection on this address, which causes the ipadm create-addr call to fail in the way we see.

We've put a core from in.ndpd which proves the invalid state at /staff/core/omicron-6947/core.7593. From there, we can see that the phyint.pi_autoconf flag is set to true on an interface with the right name:

> phyints::print [0] ! grep -E '_name|_autoconf'
    [0].pi_name = [ "oxControlService10" ]
    [0].pi_autoconf = 0x1 (B_TRUE)

That causes this code in ndpd_create_addrs() to find that there already exists an interface and that autoconfiguration is already in progress for it. That's not correct. At this point, we had deleted the IP interface object, but in.ndpd still believes it is there:

root@oxz_crucible_26fb3830:~# ipadm show-if
IFNAME     CLASS     STATE    CURRENT      PERSISTENT
lo0        VIRTUAL   ok       -m-v------46 ---
oxControlService10 IP down    bm--------46 ---
root@oxz_crucible_26fb3830:~# ipadm delete-if oxControlService10
root@oxz_crucible_26fb3830:~# gcore 7593
gcore: core.7593 dumped
root@oxz_crucible_26fb3830:~# mdb core.7593
Loading modules: [ libc.so.1 ld.so.1 ]
> num_of_phyint::print
mdb: failed to dereference symbol: unknown symbol name
> num_of_phyints::print
0x1
> phyints::print [0]
[0] = {
    [0].pi_next = 0
    [0].pi_prev = 0
    [0].pi_prefix_list = 0
    [0].pi_router_list = 0
    [0].pi_adv_prefix_list = 0
    [0].pi_index = 0x2
    [0].pi_name = [ "oxControlService10" ]
    [0].pi_sock = 0xffffffff
    [0].pi_ifaddr = ::
    [0].pi_flags = 0x21002000840
    [0].pi_mtu = 0
    [0].pi_token = ::
    [0].pi_token_length = 0
    [0].pi_stateless = 0x1 (B_TRUE)
    [0].pi_stateful = 0x1 (B_TRUE)
    [0].pi_tmp_token = ::
    [0].pi_dst_token = ::
    [0].pi_state = 0
    [0].pi_kernel_state = 0
    [0].pi_num_k_routers = 0
    [0].pi_reach_time_since_random = 0
    [0].pi_adv_time_left = 0
    [0].pi_adv_time_since_sent = 0
    [0].pi_adv_state = 0 (NO_ADV)
    [0].pi_adv_count = 0
    [0].pi_sol_time_left = 0
    [0].pi_sol_state = 0 (0)
    [0].pi_sol_count = 0
    [0].pi_config = [
        {
            cf_value = 0x1
            cf_notdefault = 0 (0)
        },
        {
            cf_value = 0
            cf_notdefault = 0 (0)
        },
...

Most of the state here is cleared, but the pi_autoconf flag is not. That seems to be reused in error when we try to recreate the address object.

Again, we're not sure how the system got into this state, but we are pretty confident that there is some race that prevents in.ndpd from completely resetting its state, in particular the pi_autoconf field.

@rcgoodfellow
Copy link
Contributor

rcgoodfellow commented Oct 29, 2024

A simple repro for this issue is

#!/bin/bash

set -e

i=0

while true; do
    svcadm restart ndp
    ipadm create-addr -t -T addrconf vnic0/ll
    ipadm delete-if vnic0
    i=$((i+1))
    printf "\r$i"
done

This reproduces as of helios-2.0.22997

citrus-it added a commit that referenced this issue Nov 1, 2024
There is a race condition between being able to configure an
addrconf address on an interface, and the NDP daemon starting up.
If they start in parallel, there is a chance that configuring
network interfaces will fail resulting in a broken zone. There
is a fix to implement better handling of the conflict in stlouis
but we should also just wait for the NDP service to come up before
attempting to configure interfaces.

Fixes #6947 (along with stlouis#650)
citrus-it added a commit that referenced this issue Nov 1, 2024
There is a race condition between being able to configure an
addrconf address on an interface, and the NDP daemon starting up.
If they start in parallel, there is a chance that configuring
network interfaces will fail resulting in a broken zone. There
is a fix to implement better handling of the conflict in stlouis
but we should also just wait for the NDP service to come up before
attempting to configure interfaces.

Fixes #6947 (along with stlouis#650)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants