Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot partition processor state to object store #1807

Open
7 of 9 tasks
Tracked by #1675
tillrohrmann opened this issue Aug 8, 2024 · 2 comments
Open
7 of 9 tasks
Tracked by #1675

Snapshot partition processor state to object store #1807

tillrohrmann opened this issue Aug 8, 2024 · 2 comments

Comments

@tillrohrmann
Copy link
Contributor

tillrohrmann commented Aug 8, 2024

To support partition processor bootstrap, catching up stale processors after downtime (i.e. handle trim gaps), and to safely trim the log, we need snapshotting support.

Scope and features

How will snapshots be triggered? What is their frequency?

  • PP leaders will create snapshots autonomously, based on accumulated commits (and not strictly on time passage)
  • we leave the door open for the cluster controller to orchestrate this with a global view of PP state and their regional placement in the future

Where do snapshots go?

  • we will use the object_store create to support various cloud object stores
  • this also supports a local filesystem target for testing

How are snapshots organized in the target bucket?

  • we will use a {partition_id}/{snapshot_id}_{lsn} structure that allows us to avoid coordination with a top-level {partition_id}/latest.json pointer that is atomically updated to reflect the most recent snapshot
  • if a node is appropriately configured, it should be able to bootstrap and catch up to the log absent CC

How will trimming be driven?

  • The CC will use snapshot information (reported by PPs) to trigger log trimming

How will the Cluster Controller learn about what valid snapshots exist (and in which locations in the future)?

  • PPs will include the latest known snapshot LSN they are aware of (either because they bootstrapped from it, or because they uploaded it themselves) in the cluster status response message
  • the snapshot repository supports efficient (single GET) lookup of the latest snapshot metadata (per partition!)
  • this can be extended to support region-awareness for trim decisions in the future (i.e. only trim at a point which is snapshotted across multiple regions)

How will PPs be bootstrapped from a snapshot?

  • the bucket location will be known and all nodes are expected to to have access to it (even across regional boundaries)
  • on restore, the restored-from snapshot LSN becomes the last-archived-lsn property for the PP for the node

How will we handle trim gaps?

  • using the same mechanism as bootstrap - when PPs encounter a trim gap in the log, they will need to revert to the object store snapshot bucket to find a more recent state snapshot to recover from

Who manages the lifecycle of snapshots?

  • we will leave this up to the operator to configure an object store lifecycle policy; all cloud providers support rich automated management with multiple storage/archival tiers

Additional considerations:

  • Avoid interacting with the object store on startup (unless we are bootstrapping)
  • Instead, each PP will track the last known snapshot LSN locally
  • Avoid time-based/periodic snapshots, we can snapshot based on number of records since last snapshot, or number of bytes, etc.
  • How will bootstrapping a region work, once we already have some up? (Likely either by cross-regional initial snapshot ingestion, or by the operator manually seeding a regional snapshot bucket from another region)
  • Should we drop the current SnapshotIds? No; even though we don't intend to use them as keys in the snapshot object store layout, it's useful to have a unique "event id" for the snapshot generation to e.g. track down logs related to its creation. It only needs to exist within the snapshot metadata uploaded to the store

Consider but don't implement

  • for geo-distributed deployments, we will have independent buckets in each region, and have region-aware snapshot triggering to ensure that we don't rely on any async replication mechanism (that can still be enabled as another layer of defense by the operator)
  • how do we protect the only/last snapshot from being deleted by aggressive object store lifecycle management policies
@pcholakov
Copy link
Contributor

pcholakov commented Nov 7, 2024

Rough notes from chatting with @tillrohrmann:

  • snapshots should be driven by RPC from the cluster controller; pick a random "reasonably fresh" node prioritizing non-leaders
    • snapshotting policy can thus be centrally managed on the admin node(s)
    • the request command to generate a snapshot could include the expected minimum LSN to improve safety (i.e. if we request a stale PP to make snapshot, it should decline)
  • use a directory/key prefix layout like $base/<partition_id>/<lsn>/... for easy navigation
  • access to object store (if used) is uniform: all nodes have read/write access to the same bucket; this can be fine-tuned down the line

Out of scope for now:

  • incremental snapshots, reusing prior snapshots' SSTs

Open questions:

  • how will we handle multi-region / geo-replicated support?
  • how will the snapshot lifecycle be managed?
  • how will PPs be bootstrapped from the snapshot store? does the cluster controller get involved?

Some thoughts on the open questions:

how will we handle multi-region / geo-replicated support?

I think we can leave this out of scope for now and only manage it in the object store config; S3 and Azure Blob store both support async cross-region replication. For something like snapshots where picking a slightly older one to bootstrap from is ok, this is completely acceptable. In the worst case, new PPs won't be able to star up in a region whose snapshot bucket replication is running well behind the log tail. And a region in such condition will likely be experiencing other difficulties beyond just snapshot staleness.

how will the snapshot lifecycle be managed?

My 2c: we should upload snapshots and update them again, leaving this to be managed via object store policies. For example, S3 supports rich lifecycle policies to migrate objects to cheaper storage classes, or delete them after a while. The one exception is local directory snapshots. Assuming those are used only for short-lived test clusters, we shouldn't have long-term disk usage problems with them.

@pcholakov
Copy link
Contributor

Updated issue description based on our internal discussion yesterday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants