Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PUT /omicron-physical-disks timeout on rack3 #6904

Open
jgallagher opened this issue Oct 18, 2024 · 4 comments
Open

PUT /omicron-physical-disks timeout on rack3 #6904

jgallagher opened this issue Oct 18, 2024 · 4 comments

Comments

@jgallagher
Copy link
Contributor

Today while investigating an issue with a different sled, we briefly enabled the blueprint_executor task. PUT /omicron-physical-disks requests to sled 13 (BRM42220064) were consistently timing out after 60 seconds. The sled-agent log is at /staff/rack3/BRM42220064/2024-10-18/oxide-sled-agent-default.log. We see the dropshot "client disconnected" logs from client timeouts that are consistent with the 60sec timeout error we were seeing from Nexus, like:

john@jeeves /staff/rack3/BRM42220064/2024-10-18 $ grep '\[fd00:1122:3344:10e::3\]:35457' oxide-sled-agent-default.log  | looker
21:26:54.440Z INFO SledAgent (dropshot (SledAgent)): accepted connection
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:1105
    local_addr = [fd00:1122:3344:104::1]:12345
    remote_addr = [fd00:1122:3344:10e::3]:35457
21:27:54.483Z WARN SledAgent (dropshot (SledAgent)): request handling cancelled (client disconnected)
    file = /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:887
    latency_us = 60040091
    local_addr = [fd00:1122:3344:104::1]:12345
    method = PUT
    remote_addr = [fd00:1122:3344:10e::3]:35457
    req_id = efaaabcd-36eb-49e2-881c-4de2f7b08340
    uri = /omicron-physical-disks
@smklein
Copy link
Collaborator

smklein commented Oct 18, 2024

Here's what I can see from the log:

21:27:03.488Z INFO SledAgent: physical disks ensure: Propagating new generation of disks                                                                                                      
    file = sled-agent/src/sled_agent.rs:894                                                                                                                                                   
    generation = Generation(1)                                                                                                                                                                
    sled_id = 4d6f10da-8fa0-465f-8fee-86dc174e2162                                                                                                                                            
21:27:03.488Z INFO SledAgent: physical disks ensure: Updated storage monitor                                                                                                                  
    file = sled-agent/src/sled_agent.rs:899                                                                                                                                                   
    sled_id = 4d6f10da-8fa0-465f-8fee-86dc174e2162                                                                                                                                            
21:27:03.488Z INFO SledAgent: physical disks ensure: Updated storage                                                                                                                          
    file = sled-agent/src/sled_agent.rs:868                                                                                                                                                   
    sled_id = 4d6f10da-8fa0-465f-8fee-86dc174e2162                                                                                                                                            
21:27:03.488Z INFO SledAgent: physical disks ensure: Updated zone bundler                                                                                                                     
    file = sled-agent/src/sled_agent.rs:904                                                                                                                                                   
    sled_id = 4d6f10da-8fa0-465f-8fee-86dc174e2162                                                                                                                                            
21:27:03.488Z INFO SledAgent (StorageManager): Received GetLatestResources(...)                                                                                                               
    file = sled-storage/src/manager.rs:451                                                                                                                                                    
21:27:03.488Z INFO SledAgent: physical disks ensure: Propagating new generation of disks                                                                                                      
    file = sled-agent/src/sled_agent.rs:894                                                                                                                                                   
    generation = Generation(1)                                                                                                                                                                
    sled_id = 4d6f10da-8fa0-465f-8fee-86dc174e2162                                                                                                                                            
21:27:03.488Z INFO SledAgent: physical disks ensure: Updated storage monitor                                                                                                                  
    file = sled-agent/src/sled_agent.rs:899                                                                                                                                                   
    sled_id = 4d6f10da-8fa0-465f-8fee-86dc174e2162                                                                                                                                            
21:27:03.488Z INFO SledAgent: physical disks ensure: Updated zone bundler                                                                                                                     
    file = sled-agent/src/sled_agent.rs:904                                                                                                                                                   
    sled_id = 4d6f10da-8fa0-465f-8fee-86dc174e2162                                                                                                                                            
21:27:03.488Z INFO SledAgent: use_only_these_disks: Ignoring request                                                                                                                          
    file = sled-agent/src/probe_manager.rs:115                                                                                                                                                
    last_gen = Generation(1)                                                                                                                                                                  
    requested_gen = Generation(1)                                                                                                                                                             
    sled_id = 4d6f10da-8fa0-465f-8fee-86dc174e2162                                                                                                                                            
21:27:03.488Z INFO SledAgent: physical disks ensure: Updated probes                                                                                                                           
    file = sled-agent/src/sled_agent.rs:910          

Which roughly corresponds with this sequence of code:

info!(self.log, "physical disks ensure: Propagating new generation of disks"; "generation" => ?our_gen);
// Ensure that the StorageMonitor, and the dump devices, have committed
// to start using new disks and stop using old ones.
self.inner.storage_monitor.await_generation(*our_gen).await?;
info!(self.log, "physical disks ensure: Updated storage monitor");
// Ensure that the ZoneBundler, if it was creating a bundle referencing
// the old U.2s, has stopped using them.
self.inner.zone_bundler.await_completion_of_prior_bundles().await;
info!(self.log, "physical disks ensure: Updated zone bundler");
// Ensure that all probes, at least after our call to
// "omicron_physical_disks_ensure", stop using any disks that
// may have been in-service from before that request.
self.inner.probes.use_only_these_disks(&latest_disks).await;
info!(self.log, "physical disks ensure: Updated probes");
// Do the same for instances - mark them failed if they were using
// expunged disks.
self.inner.instances.use_only_these_disks(latest_disks).await?;
info!(self.log, "physical disks ensure: Updated instances");

Note, however, that the last physical disks ensure: ... line we see is Updated probes. For whatever reason, we are not propagating this information through the instance manager, which is here:

// Do the same for instances - mark them failed if they were using
// expunged disks.
self.inner.instances.use_only_these_disks(latest_disks).await?;
info!(self.log, "physical disks ensure: Updated instances");

@smklein
Copy link
Collaborator

smklein commented Oct 18, 2024

The instance manager use_only_these_disks tries to inform instances which may be using U.2s that they should terminate immediately.

This sends a messages to the InstanceManager task:

pub async fn use_only_these_disks(
&self,
disks: AllDisks,
) -> Result<(), Error> {
let (tx, rx) = oneshot::channel();
self.inner
.tx
.send(InstanceManagerRequest::OnlyUseDisks { disks, tx })
.await
.map_err(|_| Error::FailedSendInstanceManagerClosed)?;
rx.await?
}

Which should be caught here:

Some(OnlyUseDisks { disks, tx } ) => {
self.use_only_these_disks(disks).await;
tx.send(Ok(())).map_err(|_| Error::FailedSendClientClosed)
},

and processed here:

async fn use_only_these_disks(&mut self, disks: AllDisks) {
// Consider the generation number on the incoming request to avoid
// applying old requests.
let requested_generation = *disks.generation();
if let Some(last_gen) = self.storage_generation {
if last_gen >= requested_generation {
// This request looks old, ignore it.
info!(self.log, "use_only_these_disks: Ignoring request";
"last_gen" => ?last_gen, "requested_gen" => ?requested_generation);
return;
}
}
self.storage_generation = Some(requested_generation);
info!(self.log, "use_only_these_disks: Processing new request";
"gen" => ?requested_generation);

However, I see no indication that we actually processed this message in the log. We sent similar requests to the zone_bundler, probe_manager, etc, and I can see all of those in the log.

Weirdly, I actually don't see any messages from the InstanceManager in this log.

@smklein
Copy link
Collaborator

smklein commented Oct 18, 2024

If we have more logs from this sled agent (were they rotated?) that would be extremely useful. I'm particularly interested in any messages from instance_manager.rs

@bnaecker
Copy link
Collaborator

Is it possible we're creating a zone bundle for an instance here, and the instance-manager is stuck on that for whatever reason? It seems like the ZoneBundler::await_completion_of_prior_bundles() should prevent that, but we've certainly seen that process take a long time in the past.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants