Skip to content

v0.4.0

Compare
Choose a tag to compare
@psFried psFried released this 24 Jun 13:55
· 519 commits to master since this release
4d74515

This release introduces number of big changes in different areas, including:

  • Schema evolution
  • Inferred schema handling
  • Flowctl
  • Re-using old spec names
  • General control-plane operation

Schema evolution

Schema evolution in streaming systems is hard. When we first released Flow, the approach to schema evolution was one of it's "killer features" because we were able to validate type-compatibility of heterogeneous pipelines (e.g. Postgres->BigQuery) end-to-end. But detecting incompatible schema changes is one thing, and deciding what to do about them is another. Real life data pipelines have many complex requirements, which clearly can't be handled by the one evolveIncompatibleCollections boolean we had on capture specs. People wanted more control over how our automation responds to incompatible schema changes.

So we're introducing a new onIncompatibleSchemaChange field on materialization specs, which allows you to configure how the system responds when incompatible schema changes are detected. You can specify onIncompatibleSchemaChange at the top level of a materialization spec, and/or as part of each binding. The top-level property serves as a default for any binding that does not set its own onIncompatibleSchemaChange. It has four possible values:

  • backfill (default if unspecified): increment the backfill counter of affected bindings, which re-creates the destination resources to fit the new schema and backfills them.
  • disableBinding: disable the affected bindings. A human will need to re-enable them and decide how to resolve the incompatible fields
  • disableTask: disable the entire materialization. A human will need to re-enable it and decide how to resolve the incompatible fields
  • abort: don't take any automated action. A human will need to decide what to do

These behaviors apply only when an automated action observes an incompatible schema change. If you're making changes manually via the UI, onIncompatibleSchemaChange is ignored.

Note: You won't see onIncompatibleSchemaChange in the main UI yet, but it can now be set using flowctl or the "Advanced specification editor".

Note: With the introduction of onIncompatibleSchemaChange, the behavior of the existing evolveIncompatibleCollections field of captures no longer makes much sense. For the very short term, that behavior will remain unchanged. But soon we will seek to greatly simplify it. Today, that one boolean, on the capture spec, controls how the system responds to incompatible schema changes in any of the captured collections. In the future, evolveIncompatibleCollections will only pertain to collections that need to be re-created entirely. In other words, its meaning will be "re-create collections as necessary in order to publish them". In practice, this would only ever be required if you change either the key of the collection or the logical partitioning configuration.

Inferred schema handling

As a user, it's hard to get direct visibility to what the inferred schema of a collection is at any given moment. That's all changing, because now we're moving to an approach where the inferred schema gets added directly to your collection specs. The inferred schema gets added under $defs with a key of flow://inferred-schema, so it's still possible to customize other parts of the read schema, just as you would have before. The difference is that you can now see the inferred schema that's being used for each collection.

But that's not the only difference, because you can now use inferred schemas with derivations, too! To do so, just include "$ref": "flow://inferred-schema" as part of the collection's readSchema, just like any other collection. Our automation will periodically update the collection spec to inline the actual inferred schema as it notices it changing.

Lastly, we're introducing a more aggressive heuristic for inferred schema updates. Collections that have more frequent inferred schema updates will be checked much more frequently, and inferred schemas that have gone a while without any updates will be checked somewhat less frequently, up to a maximum interval of every 2 hours.

Flowctl changes

All flowctl users will need to upgrade to the latest release in order to maintain compatibility.

In addition, there's some new behavior in flowctl to help prevent accidentally overwriting changes to specs. Flowctl will now set the expectPubId property whenever you run catalog pull-specs. This property contains the id of the publication that most recently modified the spec. When publishing, we return an error if a spec has been published since the expectPubId. If this happens, you'll need to run catalog pull-specs again in order to get the freshest copy of the spec and try your changes again. This is especially important now that we in-line inferred schemas as part of collection specs, as it prevents users from accidentally publishing an outdated inferred schema.

Re-using old spec names

Previously, our control plane would prevent you from re-using a name that you'd used before, even after deleting the original specs. This was because we used the spec names as the storage prefix in cloud storage buckets, so we couldn't be sure that a new collection would be starting out with an empty storage prefix if it had the same name as a previously deleted one. Now, we add a unique alphanumeric path segment to the cloud storage path for each journal, like acmeCo/my-collection/112233445566abcd/. If you delete acmeCo/my-collection, you can now create another collection with the same name, and it will have a different alphanumeric suffix. The previous naming restriction was a common source of annoyance, so we're glad to finally get this working in a way that's much more in line with user expectations.

Note that cloud storage paths for existing collections and task recovery logs will remain unchanged. The suffix will only be added for new specifications.

General control-plane operation

These changes are grouped together because they were all enabled by the same fundamental changes to the code that handles publications and background automations.

We've made publications faster and more reliable by minimizing the tasks that get re-validated as part of a given publication. For example, if you publish a materialization, we no longer re-validate other materializations that happen to source from the same collections. And we now update the data-plane shard/journal specs (that represent the actual work/data of your pipelines) asynchronously, after the publication has committed. This keeps the UI faster, and also allows our data-plane updates to be more reliable.

Finally, we introduced a new internal framework for writing background automations. This is what has enabled the changes to inferred schema handling, schema evolution, and our asynchronous shard/journal spec updates. We're looking forward to many more features that are enabled by this framework.