You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To support partition processor bootstrap, catching up stale processors after downtime (i.e. handle trim gaps), and to safely trim the log, we need snapshotting support.
Scope and features
How will snapshots be triggered? What is their frequency?
PP leaders will create snapshots autonomously, based on accumulated commits (and not strictly on time passage)
we leave the door open for the cluster controller to orchestrate this with a global view of PP state and their regional placement in the future
Where do snapshots go?
we will use the object_store create to support various cloud object stores
this also supports a local filesystem target for testing
How are snapshots organized in the target bucket?
we will use a {partition_id}/{snapshot_id}_{lsn} structure that allows us to avoid coordination with a top-level {partition_id}/latest.json pointer that is atomically updated to reflect the most recent snapshot
if a node is appropriately configured, it should be able to bootstrap and catch up to the log absent CC
How will trimming be driven?
The CC will use snapshot information (reported by PPs) to trigger log trimming
How will the Cluster Controller learn about what valid snapshots exist (and in which locations in the future)?
PPs will include the latest known snapshot LSN they are aware of (either because they bootstrapped from it, or because they uploaded it themselves) in the cluster status response message
the snapshot repository supports efficient (single GET) lookup of the latest snapshot metadata (per partition!)
this can be extended to support region-awareness for trim decisions in the future (i.e. only trim at a point which is snapshotted across multiple regions)
How will PPs be bootstrapped from a snapshot?
the bucket location will be known and all nodes are expected to to have access to it (even across regional boundaries)
on restore, the restored-from snapshot LSN becomes the last-archived-lsn property for the PP for the node
How will we handle trim gaps?
using the same mechanism as bootstrap - when PPs encounter a trim gap in the log, they will need to revert to the object store snapshot bucket to find a more recent state snapshot to recover from
Who manages the lifecycle of snapshots?
we will leave this up to the operator to configure an object store lifecycle policy; all cloud providers support rich automated management with multiple storage/archival tiers
Additional considerations:
Avoid interacting with the object store on startup (unless we are bootstrapping)
Instead, each PP will track the last known snapshot LSN locally
Avoid time-based/periodic snapshots, we can snapshot based on number of records since last snapshot, or number of bytes, etc.
How will bootstrapping a region work, once we already have some up? (Likely either by cross-regional initial snapshot ingestion, or by the operator manually seeding a regional snapshot bucket from another region)
Should we drop the current SnapshotIds? No; even though we don't intend to use them as keys in the snapshot object store layout, it's useful to have a unique "event id" for the snapshot generation to e.g. track down logs related to its creation. It only needs to exist within the snapshot metadata uploaded to the store
Consider but don't implement
for geo-distributed deployments, we will have independent buckets in each region, and have region-aware snapshot triggering to ensure that we don't rely on any async replication mechanism (that can still be enabled as another layer of defense by the operator)
how do we protect the only/last snapshot from being deleted by aggressive object store lifecycle management policies
The content you are editing has changed. Please copy your edits and refresh the page.
snapshots should be driven by RPC from the cluster controller; pick a random "reasonably fresh" node prioritizing non-leaders
snapshotting policy can thus be centrally managed on the admin node(s)
the request command to generate a snapshot could include the expected minimum LSN to improve safety (i.e. if we request a stale PP to make snapshot, it should decline)
use a directory/key prefix layout like $base/<partition_id>/<lsn>/... for easy navigation
access to object store (if used) is uniform: all nodes have read/write access to the same bucket; this can be fine-tuned down the line
how will we handle multi-region / geo-replicated support?
how will the snapshot lifecycle be managed?
how will PPs be bootstrapped from the snapshot store? does the cluster controller get involved?
Some thoughts on the open questions:
how will we handle multi-region / geo-replicated support?
I think we can leave this out of scope for now and only manage it in the object store config; S3 and Azure Blob store both support async cross-region replication. For something like snapshots where picking a slightly older one to bootstrap from is ok, this is completely acceptable. In the worst case, new PPs won't be able to star up in a region whose snapshot bucket replication is running well behind the log tail. And a region in such condition will likely be experiencing other difficulties beyond just snapshot staleness.
how will the snapshot lifecycle be managed?
My 2c: we should upload snapshots and update them again, leaving this to be managed via object store policies. For example, S3 supports rich lifecycle policies to migrate objects to cheaper storage classes, or delete them after a while. The one exception is local directory snapshots. Assuming those are used only for short-lived test clusters, we shouldn't have long-term disk usage problems with them.
To support partition processor bootstrap, catching up stale processors after downtime (i.e. handle trim gaps), and to safely trim the log, we need snapshotting support.
Scope and features
How will snapshots be triggered? What is their frequency?
Where do snapshots go?
object_store
create to support various cloud object storesHow are snapshots organized in the target bucket?
{partition_id}/{snapshot_id}_{lsn}
structure that allows us to avoid coordination with a top-level{partition_id}/latest.json
pointer that is atomically updated to reflect the most recent snapshotHow will trimming be driven?
How will the Cluster Controller learn about what valid snapshots exist (and in which locations in the future)?
How will PPs be bootstrapped from a snapshot?
How will we handle trim gaps?
Who manages the lifecycle of snapshots?
Additional considerations:
Consider but don't implement
Tasks
The text was updated successfully, but these errors were encountered: