Add docs about backwards compat design (#7085)

oxidecomputer · Nov 26, 2024 · 8674bfc · 8674bfc
1 parent 1f83f07
commit 8674bfc
Showing 1 changed file with 33 additions and 0 deletions.
diff --git a/docs/control-plane-architecture.adoc b/docs/control-plane-architecture.adoc
@@ -205,6 +205,37 @@ It's essential that components provide visibility into what they're doing for de
 * Sagas are potentially long-lived.  Without needing any per-saga work, the saga log provides detailed information about which steps have run, which steps are in-progress, and the results of each step that completed.
 * Background tasks are continuous processes.  They can provide whatever detailed status they want to, including things like: activity counters, error counters, ringbuffers of recent events, data produced by the task, etc.  These can be viewed with `omdb`.
 
+=== Backwards Compatibility
+
+==== Rules for data compatibility across versions (read this)
+
+These rules are the most important things to focus on in terms of backwards compatibility, because they are the ad-hoc steps not covered by our infrastructure. Following these 2 rules should help make migrations safe and seamless, which will allow the production and test code to use the latest
+version of the given data structures at all times.
+
+1. Ensure the code to perfom an upgrade / backfill is in one location. This makes it easier to find and remove once it is no longer needed. It also makes it easier to test in isolation, and to understand the complete change.
+2. When doing a migration from an old format to a new format, prefer to do it up front during some kind of startup operation so that the rest of the system can operate only in the new world. The system should not be trying to backfill
+data during normal operation as this makes code have to support both the old and new formats simultaneously and creates more code bifurcation for testing.
+
+==== Rationale and Background (read this if you care)
+
+===== Network services
+
+In general, backwards compatibility between services will be provided at the API level as described in <<rfd421>>. Most internal control plane service APIs are Dropshot based and therefore can utilize the same strategy. Some other services, such as trust quroum and Crucible, operate over TCP with custom protocols and have their own mechanisms for backwards compatibility. The introduction of new services of this type should be largely unnecessary for the foreseeable future.
+
+===== Database state
+
+While runtime compatibility between services is largely taken care of semantically, we still have to worry about compatibility of data on persistent storage. As a distributed system that cannot be atomically updated, a rack may have different versions of software running on different sleds with each sled containing persistent state in a slightly different format. Furthermore, the structure of this data may be different across different customer racks depending upon when they were first setup. We have various categories of persistent state. Some of it is stored in database management systems (DBMS), where schemas are concrete and well-defined. For these scenarios, we can rely on our schema migration strategy as defined in <<rfd527>>. After much discussion, this is largely a "solved" problem.
+
+===== Ad-hoc persistent state
+
+Slightly more concerning are things like serde serialized data types stored in JSON on various drives in the system. Most of these are stored in https://github.com/oxidecomputer/omicron/blob/5b865b74208ce0a11b8aec1bca12e2a6ea538bb6/common/src/ledger.rs#L48-L62[Ledgers] across both M.2 drives and are only read by the sled-agent. These ledgers are used to store things such as the initial rack plan, key shares for trust quorum, and networking (bootstore data) for early cold boot support. We have largely been dealing with these in an ad-hoc manner. In most cases, new code in sled-agent reads the old structure and writes the new version to disk on sled-agent startup. This largely works fine, but in some instances has caused problems during upgrade when this was not done properly. This seems to be a reliable strategy so far for this limited set of ledger data, and it is unlikely we will need to change it. We do have to carefully test our upgrade paths, but we should be doing that anyway, and our support on this front is being worked on currently. An additional concern is to remember to prune old version support once all customers are past the point of needing it. 
+
+It is also important to note why the previous strategy works well and is largely foolproof. Each of these structures is coupled to a local process and only written and read by that process in a controlled manner. Format modifications are only made during update and are only visible locally. And most importantly, the code to perform those reads and writes is largely centralized in a single method, or at least single file per ledger. This makes it easy to reason about and unit test. 
+
+===== Migration of state from sled-agent to nexus
+
+Now we get to what has been the hairiest of the problems for data compatibility across versions. As we add more features, and make our system more consistent in its promise that Nexus manages state for the control plane instead of sled-agent, we have realized that Nexus sometimes doesn't have enough information to take over this responsibility. In such cases when performing https://github.com/oxidecomputer/omicron/blob/5b865b74208ce0a11b8aec1bca12e2a6ea538bb6/sled-agent/src/sim/server.rs#L254[RSS handoff to Nexus], we have had to add new state to the handoff message so that Nexus can create a blueprint to drive the rest of the system to its desired state via https://github.com/oxidecomputer/omicron/blob/main/docs/reconfigurator.adoc[Reconfigurator]. However, this only works for new rack deployments when we actually run RSS. For existing deployments that have already gone through initial rack setup, the new Nexus code does not have enough information to proceed with running reconfigurator. In this case we must **backfill** that information. This can, and has, been done a variety of ways. We sometimes may have to add new data to CRDB, and sometimes modify a schema and backfill columns. Othertimes, we may need to retrieve important data from sled-agent and store it in existing placeholders in blueprints. In any event, doing this is tricky and influences how legible the code is to read, how testable it is, and how correct it is under all circumstances. It's for this reason that we proposed the rules for data compatibility in the prior section, which largely align with how we do ledger updates.
+
 == Cold start
 
 "Cold start" refers to starting the control plane from a rack that's completely powered off.  Achieving this requires careful consideration of where configuration is stored and how configuration changes flow through the system.
@@ -251,4 +282,6 @@ Unfortunately, most of these RFDs are not yet public.
 * [[[rfd210, RFD 210]]] https://rfd.shared.oxide.computer/rfd/210/[RFD 210 Omicron, service processors, and power shelf controllers]
 * [[[rfd248, RFD 248]]] https://rfd.shared.oxide.computer/rfd/248/[RFD 248 Omicron service discovery: server side]
 * [[[rfd373, RFD 373]]] https://rfd.shared.oxide.computer/rfd/373/[RFD 373 Reliable Persistent Workflows]
+* [[[rfd421, RFD 421]]] https://rfd.shared.oxide.computer/rfd/0421[RFD 421 Using OpenAPI as a locus of update compatibility]
 * [[[rfd468, RFD 468]]] https://rfd.shared.oxide.computer/rfd/468/[RFD 468 Rolling out replicated ClickHouse to new and existing racks]
+* [[[rfd527, RFD 527]]] https://rfd.shared.oxide.computer/rfd/0527[RFD 527 Online database schema updates]