Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add maintenance mode for upgrades #2211

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

iamKunalGupta
Copy link
Member

@iamKunalGupta iamKunalGupta commented Nov 1, 2024

  • Introduces Maintenance mode (status is available via dynamic config: PEERDB_MAINTENANCE_MODE_ENABLED)
  • Maintenance mode consists of 2 Workflows:
    • StartMaintenance - for pre-upgrade, responsible for
      • Waiting for running snapshots
      • Updating dynamic config to true
      • Pausing and backing up currently running mirrors
    • EndMaintenance - for post-upgrade, responsible for
      • Resuming backed up mirrors
      • Updating dynamic config to false
  • During the upgrade (between Start and End), mirrors cannot be mutated/created in any way,
  • There is also an instance info API which returns Ready/Maintenance which can be used for UI changes later.

There are 2 ways to trigger these 2 workflows:

  1. API call to flow-api
  2. Running the new maintenance entrypoint with the respective args

A new task queue is added so that the maintenance tasks can be spun up even during pre-upgrade hooks (from version earlier than ones containing this PR) and this also ensures that always the latest version of the maintenance flows run irrespective of the old version.

@iamKunalGupta iamKunalGupta marked this pull request as ready for review November 1, 2024 04:16
@iamKunalGupta iamKunalGupta requested a review from a team November 1, 2024 04:24
@iamKunalGupta iamKunalGupta changed the title feat: add maintenance mode feat: add maintenance mode for upgrades Nov 1, 2024
}

func UpdatePeerDBMaintenanceModeEnabled(ctx context.Context, pool *pgxpool.Pool, enabled bool) error {
return UpdateDynamicSetting(ctx, pool, "PEERDB_MAINTENANCE_MODE_ENABLED", ptr.String(strconv.FormatBool(enabled)))
Copy link
Contributor

@serprex serprex Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is controlled by system I'm not sure we want to expose it to user where they can toggle it, in which case it shouldn't be in dynconf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be helpful for manual intervention or maintenance. We can hide it later from dynamic config if needed

Copy link
Contributor

@serprex serprex Nov 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking it'd be nice to be able to auth into ui with peerdb/clickhouse email & have access to extended functionality. Then this could be in catalog & have ui for us

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe via oauth where different roles can map to different permissions like superadmin, admin or readonly

flow/activities/maintenance_activity.go Outdated Show resolved Hide resolved
logEvery time.Duration,
alertEvery time.Duration,
) (protos.FlowStatus, error) {
// In case a mirror was just kicked off, it shows up in the running state, we wait for a bit before checking for snapshot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this really shouldn't happen, seems like a bug

protos/route.proto Outdated Show resolved Hide resolved
flow/activities/maintenance_activity.go Outdated Show resolved Hide resolved
flow/activities/maintenance_activity.go Outdated Show resolved Hide resolved
flow/activities/maintenance_activity.go Outdated Show resolved Hide resolved
flow/activities/maintenance_activity.go Outdated Show resolved Hide resolved
flow/cmd/maintenance.go Outdated Show resolved Hide resolved
docker-bake.hcl Show resolved Hide resolved
flow/cmd/handler.go Show resolved Hide resolved

flowStatus, err := RunEveryIntervalUntilFinish(ctx, func() (bool, protos.FlowStatus, error) {
activity.RecordHeartbeat(ctx, fmt.Sprintf("Waiting for mirror %s to finish snapshot", mirror.MirrorName))
mirrorStatus, err = a.getMirrorStatus(ctx, mirror)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mirrorStatus, err = a.getMirrorStatus(ctx, mirror)
mirrorStatus, err := a.getMirrorStatus(ctx, mirror)

Comment on lines +92 to +101
mirrorStatus, err := a.getMirrorStatus(ctx, mirror)
if err != nil {
return mirrorStatus, err
}

slog.Info("Checking and waiting if mirror is snapshot", "mirror", mirror.MirrorName, "workflowId", mirror.WorkflowId, "status",
mirrorStatus.String())
if mirrorStatus != protos.FlowStatus_STATUS_SNAPSHOT && mirrorStatus != protos.FlowStatus_STATUS_SETUP {
return mirrorStatus, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mirrorStatus, err := a.getMirrorStatus(ctx, mirror)
if err != nil {
return mirrorStatus, err
}
slog.Info("Checking and waiting if mirror is snapshot", "mirror", mirror.MirrorName, "workflowId", mirror.WorkflowId, "status",
mirrorStatus.String())
if mirrorStatus != protos.FlowStatus_STATUS_SNAPSHOT && mirrorStatus != protos.FlowStatus_STATUS_SETUP {
return mirrorStatus, nil
}

Comment on lines +158 to +159
err = model.FlowSignal.SignalClientWorkflow(ctx, a.TemporalClient, mirror.WorkflowId, "", model.PauseSignal)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err = model.FlowSignal.SignalClientWorkflow(ctx, a.TemporalClient, mirror.WorkflowId, "", model.PauseSignal)
if err != nil {
if err := model.FlowSignal.SignalClientWorkflow(ctx, a.TemporalClient, mirror.WorkflowId, "", model.PauseSignal); err != nil {

elsewhere too

Comment on lines +219 to +227
row := pool.QueryRow(ctx, `
select cli_version, api_version, skipped, skipped_reason
from maintenance.start_maintenance_outputs
order by created_at desc
limit 1
`)
var result StartMaintenanceResult
err = row.Scan(&result.CLIVersion, &result.APIVersion, &result.Skipped, &result.SkippedReason)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
row := pool.QueryRow(ctx, `
select cli_version, api_version, skipped, skipped_reason
from maintenance.start_maintenance_outputs
order by created_at desc
limit 1
`)
var result StartMaintenanceResult
err = row.Scan(&result.CLIVersion, &result.APIVersion, &result.Skipped, &result.SkippedReason)
if err != nil {
var result StartMaintenanceResult
if err := pool.QueryRow(ctx, `
select cli_version, api_version, skipped, skipped_reason
from maintenance.start_maintenance_outputs
order by created_at desc
limit 1
`).Scan(&result.CLIVersion, &result.APIVersion, &result.Skipped, &result.SkippedReason); err != nil {

var state protos.FlowStatus
err = res.Get(&state)
if err != nil {
slog.Error(fmt.Sprintf("failed to get status in workflow with ID %s: %s", workflowID, err.Error()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
slog.Error(fmt.Sprintf("failed to get status in workflow with ID %s: %s", workflowID, err.Error()))
slog.Error("failed to get status in workflow with ID "+workflowID, slog.Any("error", err)))

Comment on lines +21 to +22
err = res.Get(&state)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err = res.Get(&state)
if err != nil {
if err := res.Get(&state); err != nil {

Comment on lines +16 to +18
slog.Error(fmt.Sprintf("failed to get status in workflow with ID %s: %s", workflowID, err.Error()))
return protos.FlowStatus_STATUS_UNKNOWN,
fmt.Errorf("failed to get status in workflow with ID %s: %w", workflowID, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
slog.Error(fmt.Sprintf("failed to get status in workflow with ID %s: %s", workflowID, err.Error()))
return protos.FlowStatus_STATUS_UNKNOWN,
fmt.Errorf("failed to get status in workflow with ID %s: %w", workflowID, err)
slog.Error(fmt.Sprintf("failed to query status in workflow with ID %s: %s", workflowID, err.Error()))
return protos.FlowStatus_STATUS_UNKNOWN,
fmt.Errorf("failed to query status in workflow with ID %s: %w", workflowID, err)

makes error messages distinct so that when looking up error messages in code we know which line was hit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants