Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] Support Bundle background task #7063

Open
wants to merge 41 commits into
base: support-bundle-simulated-implementation
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
410ff5b
Skeleton of background task
smklein Nov 13, 2024
2234301
less bp
smklein Nov 14, 2024
aeec097
Merge branch 'support-bundles-crdb' into support-bundle-bg-task
smklein Nov 14, 2024
11a88d6
Merge branch 'support-bundles-crdb' into support-bundle-bg-task
smklein Nov 25, 2024
01949c9
Merge UUID changes
smklein Nov 25, 2024
5d02715
Merge branch 'support-bundles-crdb' into support-bundle-bg-task
smklein Nov 25, 2024
8052bbe
Merge branch 'support-bundles-crdb' into support-bundle-bg-task
smklein Nov 25, 2024
744e351
Implement storage, use sled agent APIs
smklein Nov 26, 2024
f6f28a0
Merge branch 'support-bundles-crdb' into support-bundle-bg-task
smklein Nov 27, 2024
4feff4b
Merge branch 'support-bundles-crdb' into support-bundle-bg-task
smklein Nov 27, 2024
8b8fab7
update omdb output
smklein Nov 27, 2024
4f6d2d2
Merge branch 'support-bundles-crdb' into support-bundle-bg-task
smklein Nov 28, 2024
0f8f750
Merge branch 'support-bundles-crdb' into support-bundle-bg-task
smklein Dec 2, 2024
417cab9
update description
smklein Dec 2, 2024
d10e034
big one, should be split up
smklein Dec 5, 2024
3c5047d
Fixing clippy
smklein Dec 5, 2024
f9a2efd
Merge branch 'support-bundles-crdb' into support-bundle-bg-task
smklein Dec 6, 2024
59612b0
Fix (and test) ordering, output paths
smklein Dec 6, 2024
366204f
Merge branch 'support-bundles-crdb' into support-bundle-bg-task
smklein Dec 9, 2024
569a28c
Wire up bundles to blueprints, more tests
smklein Dec 10, 2024
83bdc2e
Support bundle re-assignment is no-op without any bundles to fail
smklein Dec 10, 2024
6661475
Merge branch 'support-bundles-crdb' into support-bundle-bg-task
smklein Dec 10, 2024
dcf4bed
comment cleaning
smklein Dec 10, 2024
5cc668e
Merge branch 'support-bundles-crdb' into support-bundle-bg-task
smklein Dec 16, 2024
51be611
Handle bundle cleanup from deleted zpool
smklein Dec 16, 2024
fa25cd2
Merge branch 'support-bundles-crdb' into support-bundle-bg-task
smklein Dec 17, 2024
80edd57
Make zpool_get_sled more paranoid
smklein Dec 17, 2024
900c670
Merge branch 'support-bundle-simulated-implementation' into support-b…
smklein Dec 17, 2024
5f3f507
Merge branch 'support-bundle-simulated-implementation' into support-b…
smklein Dec 19, 2024
a39f6a8
Merge branch 'support-bundle-simulated-implementation' into support-b…
smklein Jan 6, 2025
8557ad0
Merge branch 'support-bundle-simulated-implementation' into support-b…
smklein Jan 6, 2025
14d39fd
Merge branch 'support-bundle-simulated-implementation' into support-b…
smklein Jan 6, 2025
917759d
config.test.toml
smklein Jan 6, 2025
7d166f1
Error logging
smklein Jan 6, 2025
bcf6f8c
Propagate unexpected errors during bundle activation
smklein Jan 7, 2025
76be221
remove printlns
smklein Jan 7, 2025
d23826e
Use BufReader for reading entries
smklein Jan 7, 2025
1baad89
Merge branch 'support-bundle-simulated-implementation' into support-b…
smklein Jan 8, 2025
39394ff
Update to deal with structured support bundle APIs from sled agent
smklein Jan 8, 2025
53dcbe6
Use /var/tmp
smklein Jan 9, 2025
f40f1a4
Merge branch 'support-bundle-simulated-implementation' into support-b…
smklein Jan 9, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 12 additions & 0 deletions dev-tools/omdb/tests/env.out
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,10 @@ task: "service_zone_nat_tracker"
ensures service zone nat records are recorded in NAT RPW table


task: "support_bundle_collector"
Manage support bundle collection and cleanup


task: "switch_port_config_manager"
manages switch port settings for rack switches

Expand Down Expand Up @@ -333,6 +337,10 @@ task: "service_zone_nat_tracker"
ensures service zone nat records are recorded in NAT RPW table


task: "support_bundle_collector"
Manage support bundle collection and cleanup


task: "switch_port_config_manager"
manages switch port settings for rack switches

Expand Down Expand Up @@ -491,6 +499,10 @@ task: "service_zone_nat_tracker"
ensures service zone nat records are recorded in NAT RPW table


task: "support_bundle_collector"
Manage support bundle collection and cleanup


task: "switch_port_config_manager"
manages switch port settings for rack switches

Expand Down
18 changes: 18 additions & 0 deletions dev-tools/omdb/tests/successes.out
Original file line number Diff line number Diff line change
Expand Up @@ -380,6 +380,10 @@ task: "service_zone_nat_tracker"
ensures service zone nat records are recorded in NAT RPW table


task: "support_bundle_collector"
Manage support bundle collection and cleanup


task: "switch_port_config_manager"
manages switch port settings for rack switches

Expand Down Expand Up @@ -693,6 +697,13 @@ task: "service_zone_nat_tracker"
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
last completion reported error: inventory collection is None

task: "support_bundle_collector"
configured period: every <REDACTED_DURATION>s
currently executing: no
last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
last completion reported error: task disabled

task: "switch_port_config_manager"
configured period: every <REDACTED_DURATION>s
currently executing: no
Expand Down Expand Up @@ -1138,6 +1149,13 @@ task: "service_zone_nat_tracker"
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
last completion reported error: inventory collection is None

task: "support_bundle_collector"
configured period: every <REDACTED_DURATION>s
currently executing: no
last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
last completion reported error: task disabled

task: "switch_port_config_manager"
configured period: every <REDACTED_DURATION>s
currently executing: no
Expand Down
23 changes: 23 additions & 0 deletions nexus-config/src/nexus_config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -371,6 +371,8 @@ pub struct BackgroundTaskConfig {
pub nat_cleanup: NatCleanupConfig,
/// configuration for inventory tasks
pub inventory: InventoryConfig,
/// configuration for support bundle collection
pub support_bundle_collector: SupportBundleCollectorConfig,
/// configuration for physical disk adoption tasks
pub physical_disk_adoption: PhysicalDiskAdoptionConfig,
/// configuration for decommissioned disk cleaner task
Expand Down Expand Up @@ -458,6 +460,20 @@ pub struct ExternalEndpointsConfig {
// allow/disallow wildcard certs, don't serve expired certs, etc.)
}

#[serde_as]
#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
pub struct SupportBundleCollectorConfig {
/// period (in seconds) for periodic activations of this background task
#[serde_as(as = "DurationSeconds<u64>")]
pub period_secs: Duration,

/// A toggle to disable support bundle collection
///
/// Default: Off
#[serde(default)]
pub disable: bool,
}

#[serde_as]
#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
pub struct PhysicalDiskAdoptionConfig {
Expand Down Expand Up @@ -931,6 +947,7 @@ mod test {
inventory.period_secs = 10
inventory.nkeep = 11
inventory.disable = false
support_bundle_collector.period_secs = 30
physical_disk_adoption.period_secs = 30
decommissioned_disk_cleaner.period_secs = 30
phantom_disks.period_secs = 30
Expand Down Expand Up @@ -1069,6 +1086,11 @@ mod test {
nkeep: 11,
disable: false,
},
support_bundle_collector:
SupportBundleCollectorConfig {
period_secs: Duration::from_secs(30),
disable: false,
},
physical_disk_adoption: PhysicalDiskAdoptionConfig {
period_secs: Duration::from_secs(30),
disable: false,
Expand Down Expand Up @@ -1203,6 +1225,7 @@ mod test {
inventory.period_secs = 10
inventory.nkeep = 3
inventory.disable = false
support_bundle_collector.period_secs = 30
physical_disk_adoption.period_secs = 30
decommissioned_disk_cleaner.period_secs = 30
phantom_disks.period_secs = 30
Expand Down
3 changes: 3 additions & 0 deletions nexus/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ hickory-resolver.workspace = true
http.workspace = true
http-body-util.workspace = true
hyper.workspace = true
hyper-staticfile.workspace = true
illumos-utils.workspace = true
internal-dns-resolver.workspace = true
internal-dns-types.workspace = true
Expand Down Expand Up @@ -81,6 +82,7 @@ serde.workspace = true
serde_json.workspace = true
serde_urlencoded.workspace = true
serde_with.workspace = true
sha2.workspace = true
sled-agent-client.workspace = true
slog.workspace = true
slog-async.workspace = true
Expand Down Expand Up @@ -119,6 +121,7 @@ update-common.workspace = true
update-engine.workspace = true
omicron-workspace-hack.workspace = true
omicron-uuid-kinds.workspace = true
zip.workspace = true

[dev-dependencies]
async-bb8-diesel.workspace = true
Expand Down
1 change: 1 addition & 0 deletions nexus/db-queries/src/db/datastore/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@ pub use region::RegionAllocationParameters;
pub use silo::Discoverability;
pub use sled::SledTransition;
pub use sled::TransitionError;
pub use support_bundle::SupportBundleExpungementReport;
pub use switch_port::SwitchPortSettingsCombinedResult;
pub use virtual_provisioning_collection::StorageType;
pub use vmm::VmmStateUpdateResult;
Expand Down
48 changes: 31 additions & 17 deletions nexus/db-queries/src/db/datastore/support_bundle.rs
Original file line number Diff line number Diff line change
Expand Up @@ -336,15 +336,6 @@ impl DataStore {
.execute_async(conn)
.await?;

let Some(arbitrary_valid_nexus) =
valid_nexus_zones.get(0).cloned()
else {
return Err(external::Error::internal_error(
"No valid Nexuses, we cannot re-assign this support bundle",
)
.into());
};

// Find all bundles on nexuses that no longer exist.
let bundles_with_bad_nexuses = dsl::support_bundle
.filter(dsl::assigned_nexus.eq_any(invalid_nexus_zones))
Expand All @@ -363,7 +354,7 @@ impl DataStore {
}
}).collect::<Vec<_>>();

// Mark these support bundles as failing, and assign then
// Mark these support bundles as failing, and assign them
// to a nexus that should still exist.
//
// This should lead to their storage being freed, if it
Expand All @@ -378,18 +369,41 @@ impl DataStore {
))
.execute_async(conn)
.await?;
let bundles_reassigned = diesel::update(dsl::support_bundle)

let mut report = SupportBundleExpungementReport {
bundles_failed_missing_datasets,
bundles_deleted_missing_datasets,
bundles_failing_missing_nexus,
bundles_reassigned: 0,
};

// Exit a little early if there are no bundles to re-assign.
//
// This is a tiny optimization, but really, it means that
// tests without Nexuses in their blueprints can succeed if
// they also have no support bundles. In practice, this is
// rare, but in our existing test framework, it's fairly
// common.
if bundles_to_reassign.is_empty() {
return Ok(report);
}

let Some(arbitrary_valid_nexus) =
valid_nexus_zones.get(0).cloned()
else {
return Err(external::Error::internal_error(
"No valid Nexuses, we cannot re-assign this support bundle",
)
.into());
};

report.bundles_reassigned = diesel::update(dsl::support_bundle)
.filter(dsl::id.eq_any(bundles_to_reassign))
.set(dsl::assigned_nexus.eq(arbitrary_valid_nexus))
.execute_async(conn)
.await?;

Ok(SupportBundleExpungementReport {
bundles_failed_missing_datasets,
bundles_deleted_missing_datasets,
bundles_failing_missing_nexus,
bundles_reassigned,
})
Ok(report)
}
.boxed()
},
Expand Down
40 changes: 40 additions & 0 deletions nexus/db-queries/src/db/datastore/zpool.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ use crate::db::error::public_error_from_diesel;
use crate::db::error::ErrorHandler;
use crate::db::identity::Asset;
use crate::db::model::PhysicalDisk;
use crate::db::model::PhysicalDiskPolicy;
use crate::db::model::PhysicalDiskState;
use crate::db::model::Sled;
use crate::db::model::Zpool;
Expand All @@ -31,9 +32,11 @@ use omicron_common::api::external::DataPageParams;
use omicron_common::api::external::DeleteResult;
use omicron_common::api::external::Error;
use omicron_common::api::external::ListResultVec;
use omicron_common::api::external::LookupResult;
use omicron_common::api::external::LookupType;
use omicron_common::api::external::ResourceType;
use omicron_uuid_kinds::GenericUuid;
use omicron_uuid_kinds::SledUuid;
use omicron_uuid_kinds::ZpoolUuid;
use uuid::Uuid;

Expand Down Expand Up @@ -270,4 +273,41 @@ impl DataStore {

Ok(())
}

pub async fn zpool_get_sled_if_in_service(
&self,
opctx: &OpContext,
id: ZpoolUuid,
) -> LookupResult<SledUuid> {
opctx.authorize(authz::Action::ListChildren, &authz::FLEET).await?;
use db::schema::physical_disk::dsl as physical_disk_dsl;
use db::schema::zpool::dsl as zpool_dsl;

let conn = self.pool_connection_authorized(opctx).await?;
let id = zpool_dsl::zpool
.filter(zpool_dsl::id.eq(id.into_untyped_uuid()))
.filter(zpool_dsl::time_deleted.is_null())
.inner_join(
physical_disk_dsl::physical_disk
.on(zpool_dsl::physical_disk_id.eq(physical_disk_dsl::id)),
)
.filter(
physical_disk_dsl::disk_policy
.eq(PhysicalDiskPolicy::InService),
)
.select(zpool_dsl::sled_id)
.first_async::<Uuid>(&*conn)
.await
.map_err(|e| {
public_error_from_diesel(
e,
ErrorHandler::NotFoundByLookup(
ResourceType::Zpool,
LookupType::by_id(id.into_untyped_uuid()),
),
)
})?;

Ok(SledUuid::from_untyped_uuid(id))
}
}
1 change: 1 addition & 0 deletions nexus/examples/config-second.toml
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,7 @@ inventory.nkeep = 5
inventory.disable = false
phantom_disks.period_secs = 30
physical_disk_adoption.period_secs = 30
support_bundle_collector.period_secs = 30
decommissioned_disk_cleaner.period_secs = 60
blueprints.period_secs_load = 10
blueprints.period_secs_execute = 60
Expand Down
1 change: 1 addition & 0 deletions nexus/examples/config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@ inventory.nkeep = 5
inventory.disable = false
phantom_disks.period_secs = 30
physical_disk_adoption.period_secs = 30
support_bundle_collector.period_secs = 30
decommissioned_disk_cleaner.period_secs = 60
blueprints.period_secs_load = 10
blueprints.period_secs_execute = 60
Expand Down
31 changes: 31 additions & 0 deletions nexus/reconfigurator/execution/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,13 @@ pub async fn realize_blueprint_with_overrides(
blueprint,
);

register_support_bundle_failure_step(
&engine.for_component(ExecutionComponent::SupportBundles),
&opctx,
datastore,
blueprint,
);

let sled_list = register_sled_list_step(
&engine.for_component(ExecutionComponent::SledList),
&opctx,
Expand Down Expand Up @@ -244,6 +251,30 @@ fn register_zone_external_networking_step<'a>(
.register();
}

fn register_support_bundle_failure_step<'a>(
registrar: &ComponentRegistrar<'_, 'a>,
opctx: &'a OpContext,
datastore: &'a DataStore,
blueprint: &'a Blueprint,
) {
registrar
.new_step(
ExecutionStepId::Ensure,
"Mark support bundles as failed if they rely on an expunged disk or sled",
move |_cx| async move {
datastore
.support_bundle_fail_expunged(
&opctx, blueprint,
)
.await
.map_err(|err| anyhow!(err))?;

StepSuccess::new(()).into()
},
)
.register();
}

fn register_sled_list_step<'a>(
registrar: &ComponentRegistrar<'_, 'a>,
opctx: &'a OpContext,
Expand Down
Loading
Loading