diff --git a/site/docs/concepts/advanced/evolutions.md b/site/docs/concepts/advanced/evolutions.md index 28c3f882e1..8b8d73fdd1 100644 --- a/site/docs/concepts/advanced/evolutions.md +++ b/site/docs/concepts/advanced/evolutions.md @@ -53,12 +53,12 @@ When you attempt to publish a breaking change to a collection in the Flow web ap Click the **Apply** button to trigger an evolution and update all necessary specification to keep your Data Flow functioning. Then, review and publish your draft. -If you enabled [AutoDiscover](../captures.md#autodiscover) on a capture, any breaking changes that it introduces will trigger an automatic schema evolution, so long as you selected the **Breaking change re-versions collections** option(`evolveIncompatibleCollections`). +If you enabled [AutoDiscover](../captures.md#autodiscover) on a capture, any breaking changes that it introduces will trigger an automatic schema evolution, so long as you selected the **Breaking change re-versions collections** option (`evolveIncompatibleCollections`). ## What do schema evolutions do? The schema evolution feature is available in the Flow web app when you're editing pre-existing Flow entities. -It notices when one of your edit would cause other components of the Data Flow to fail, alerts you, and gives you the option to automatically update the specs of these components to prevent failure. +It notices when one of your edits would cause other components of the Data Flow to fail, alerts you, and gives you the option to automatically update the specs of these components to prevent failure. In other words, evolutions happen in the *draft* state. Whenever you edit, you create a draft. Evolutions add to the draft so that when it is published and updates the active data flow, operations can continue seamlessly. diff --git a/site/docs/concepts/collections.md b/site/docs/concepts/collections.md index dfd160905e..9930c1c534 100644 --- a/site/docs/concepts/collections.md +++ b/site/docs/concepts/collections.md @@ -332,7 +332,7 @@ If desired, a derivation could re-key the collection on `[/userId, /name]` to materialize the various `/name`s seen for a `/userId`. This property makes keys less lossy than they might otherwise appear, -and it is generally good practice to chose a key that reflects how +and it is generally good practice to choose a key that reflects how you wish to _query_ a collection, rather than an exhaustive key that's certain to be unique for every document. diff --git a/site/docs/concepts/connectors.md b/site/docs/concepts/connectors.md index bfb312f23b..f2ea738dac 100644 --- a/site/docs/concepts/connectors.md +++ b/site/docs/concepts/connectors.md @@ -219,7 +219,7 @@ sops: ``` You then use this `config.yaml` within your Flow specification. -The Flow runtime knows that this document is protected by `sops` +The Flow runtime knows that this document is protected by `sops`, will continue to store it in its protected form, and will attempt a decryption only when invoking a connector on your behalf. diff --git a/site/docs/concepts/derivations.md b/site/docs/concepts/derivations.md index 1714de71e1..de7e5745e3 100644 --- a/site/docs/concepts/derivations.md +++ b/site/docs/concepts/derivations.md @@ -218,8 +218,8 @@ into JSON arrays or objects and embeds them into the mapped document: `{"greeting": "hello", "items": [1, "two", 3]}`. If parsing fails, the raw string is used instead. -If you would like to select all columns of the input collection, -rather than `select *`, use `select JSON($flow_document)`, e.g. +If you would like to select all columns of the input collection, +rather than `select *`, use `select JSON($flow_document)`, e.g. `select JSON($flow_document where $status = open;`. As a special case if your query selects a _single_ column @@ -608,6 +608,7 @@ Flow read delays are very efficient and scale better than managing very large numbers of fine-grain timers. [See Grouped Windows of Transfers for an example using a read delay](#grouped-windows-of-transfers) + [Learn more from the Citi Bike "idle bikes" example](https://github.com/estuary/flow/blob/master/examples/citi-bike/idle-bikes.flow.yaml) ### Read priority @@ -639,7 +640,7 @@ For SQLite derivations, the entire SQLite database is the internal state of the task. TypeScript derivations can use in-memory states with a recovery and checkpoint mechanism. -Estuary intends to offer an additional mechanisms for +Estuary intends to offer additional mechanisms for automatic internal state snapshot and recovery in the future. The exact nature of internal task states vary, diff --git a/site/docs/concepts/import.md b/site/docs/concepts/import.md index c5435b50d0..9645a61232 100644 --- a/site/docs/concepts/import.md +++ b/site/docs/concepts/import.md @@ -3,7 +3,7 @@ sidebar_position: 7 --- # Imports -When you work on a draft Data Flow [using `flowctl draft`](../concepts/flowctl.md#working-with-drafts), +When you work on a draft Data Flow [using `flowctl draft`](../guides/flowctl/edit-draft-from-webapp.md), your Flow specifications may be spread across multiple files. For example, you may have multiple **materializations** that read from collections defined in separate files, or you could store a **derivation** separately from its **tests**. diff --git a/site/docs/concepts/materialization.md b/site/docs/concepts/materialization.md index e714aeabb8..2a300a3fd9 100644 --- a/site/docs/concepts/materialization.md +++ b/site/docs/concepts/materialization.md @@ -26,7 +26,7 @@ You define and configure materializations in **Flow specifications**. Materializations use real-time [connectors](./connectors.md) to connect to many endpoint types. When you use a materialization connector in the Flow web app, -flow helps you configure it through the **discovery** workflow. +Flow helps you configure it through the **discovery** workflow. To begin discovery, you tell Flow the connector you'd like to use, basic information about the endpoint, and the collection(s) you'd like to materialize there. @@ -67,7 +67,7 @@ materializations: # Name of the collection to be read. # Required. name: acmeCo/example/collection - # Lower bound date-time for documents which should be processed. + # Lower bound date-time for documents which should be processed. # Source collection documents published before this date-time are filtered. # `notBefore` is *only* a filter. Updating its value will not cause Flow # to re-process documents that have already been read. @@ -93,11 +93,11 @@ materializations: # Priority applied to documents processed by this binding. # When all bindings are of equal priority, documents are processed # in order of their associated publishing time. - # + # # However, when one binding has a higher priority than others, # then *all* ready documents are processed through the binding # before *any* documents of other bindings are processed. - # + # # Optional. Default: 0, integer >= 0 priority: 0 @@ -362,24 +362,27 @@ field implemented. Consult the individual connector documentation for details. ### How It Works 1. **Source Capture Level:** - - If the source capture provides a schema or namespace, it will be used as the default schema for all bindings in - - the materialization. + + If the source capture provides a schema or namespace, it will be used as the default schema for all bindings in the materialization. 2. **Manual Overrides:** - - You can still manually configure schema names for each binding, overriding the default schema if needed. + + You can still manually configure schema names for each binding, overriding the default schema if needed. 3. **Materialization-Level Configuration:** - - The default schema name can be set at the materialization level, ensuring that all new captures within that - - materialization automatically inherit the default schema name. + + The default schema name can be set at the materialization level, ensuring that all new captures within that materialization automatically inherit the default schema name. ### Configuration Steps 1. **Set Default Schema at Source Capture Level:** - - When defining your source capture, specify the schema or namespace. If no schema is provided, Estuary Flow will - - automatically assign a default schema. - + + When defining your source capture, specify the schema or namespace. If no schema is provided, Estuary Flow will automatically assign a default schema. + 2. **Override Schema at Binding Level:** - - For any binding, you can manually override the default schema by specifying a different schema name. + + For any binding, you can manually override the default schema by specifying a different schema name. 3. **Set Default Schema at Materialization Level:** - - During the materialization configuration, set a default schema name for all captures within the materialization. + + During the materialization configuration, set a default schema name for all captures within the materialization. diff --git a/site/docs/concepts/schemas.md b/site/docs/concepts/schemas.md index b1f92d2a97..812e0c0ca2 100644 --- a/site/docs/concepts/schemas.md +++ b/site/docs/concepts/schemas.md @@ -45,7 +45,7 @@ Flow can usually generate suitable JSON schemas on your behalf. For systems like relational databases, Flow will typically generate a complete JSON schema by introspecting the table definition. -For systems that store unstructured data, Flow will typically generate a very minimal schema, and will rely on schema inferrence to fill in the details. See [continuous schema inferenece](#continuous-schema-inference) for more information. +For systems that store unstructured data, Flow will typically generate a very minimal schema, and will rely on schema inference to fill in the details. See [continuous schema inference](#continuous-schema-inference) for more information. ### Translations @@ -72,7 +72,7 @@ Schema inference is also used to provide translations into other schema flavors: ### Annotations The JSON Schema standard introduces the concept of -[annotations](http://json-schema.org/understanding-json-schema/reference/generic.html#annotations), +[annotations](https://json-schema.org/understanding-json-schema/reference/annotations), which are keywords that attach metadata to a location within a validated JSON document. For example, `title` and `description` can be used to annotate a schema with its meaning: diff --git a/site/docs/concepts/storage-mappings.md b/site/docs/concepts/storage-mappings.md index 92d143d6cf..07cd39270b 100644 --- a/site/docs/concepts/storage-mappings.md +++ b/site/docs/concepts/storage-mappings.md @@ -22,7 +22,7 @@ Flow tasks — captures, derivations, and materializations — use recovery logs Recovery logs are an opaque binary log, but may contain user data. The recovery logs of a task are always prefixed by `recovery/`, -so a task named `acmeCo/produce-TNT` would have a recovery log called `recovery/acmeCo/roduce-TNT` +so a task named `acmeCo/produce-TNT` would have a recovery log called `recovery/acmeCo/produce-TNT` Flow prunes data from recovery logs once it is no longer required. diff --git a/site/docs/guides/flowctl/edit-draft-from-webapp.md b/site/docs/guides/flowctl/edit-draft-from-webapp.md index a1d08ffa32..cbc4cbfa4f 100644 --- a/site/docs/guides/flowctl/edit-draft-from-webapp.md +++ b/site/docs/guides/flowctl/edit-draft-from-webapp.md @@ -41,13 +41,13 @@ Drafts aren't currently visible in the Flow web app, but you can get a list with 2. Run `flowctl draft list` - flowctl outputs a table of all the drafts to which you have access, from oldest to newest. + flowctl outputs a table of all the drafts to which you have access, from oldest to newest. 3. Use the name and timestamp to find the draft you're looking for. - Each draft has an **ID**, and most have a name in the **Details** column. Note the **# of Specs** column. - For drafts created in the web app, materialization drafts will always contain one specification. - A number higher than 1 indicates a capture with its associated collections. + Each draft has an **ID**, and most have a name in the **Details** column. Note the **# of Specs** column. + For drafts created in the web app, materialization drafts will always contain one specification. + A number higher than 1 indicates a capture with its associated collections. 4. Copy the draft ID. @@ -57,10 +57,10 @@ Drafts aren't currently visible in the Flow web app, but you can get a list with 7. Browse the source files. - The source files and their directory structure will look slightly different depending on the draft. - Regardless, there will always be a top-level file called `flow.yaml` that *imports* all other YAML files, - which you'll find in a subdirectory named for your catalog prefix. - These, in turn, contain the specifications you'll want to edit. + The source files and their directory structure will look slightly different depending on the draft. + Regardless, there will always be a top-level file called `flow.yaml` that *imports* all other YAML files, + which you'll find in a subdirectory named for your catalog prefix. + These, in turn, contain the specifications you'll want to edit. ## Edit the draft and publish @@ -76,7 +76,7 @@ Next, you'll make changes to the specification(s), test, and publish the draft. 3. When you're done, sync the local work to the global draft: `flowctl draft author --source flow.yaml`. - Specifying the top-level `flow.yaml` file as the source ensures that all entities in the draft are imported. + Specifying the top-level `flow.yaml` file as the source ensures that all entities in the draft are imported. 4. Publish the draft: `flowctl draft publish` diff --git a/site/docs/guides/flowctl/edit-specification-locally.md b/site/docs/guides/flowctl/edit-specification-locally.md index 8c95b612d8..f91cd64109 100644 --- a/site/docs/guides/flowctl/edit-specification-locally.md +++ b/site/docs/guides/flowctl/edit-specification-locally.md @@ -79,7 +79,7 @@ Using these names, you'll identify and pull the relevant specifications for edit * Pull a group of specifications by prefix or type filter, for example: `flowctl catalog pull-specs --prefix myOrg/marketing --collections` - The source files are written to your current working directory. + The source files are written to your current working directory. 4. Browse the source files. @@ -106,15 +106,15 @@ Next, you'll complete your edits, test that they were performed correctly, and r 3. When you're done, you can test your changes: `flowctl catalog test --source flow.yaml` - You'll almost always use the top-level `flow.yaml` file as the source here because it imports all other Flow specifications - in your working directory. + You'll almost always use the top-level `flow.yaml` file as the source here because it imports all other Flow specifications + in your working directory. - Once the test has passed, you can publish your specifications. + Once the test has passed, you can publish your specifications. 4. Re-publish all the specifications you pulled: `flowctl catalog publish --source flow.yaml` - Again you'll almost always want to use the top-level `flow.yaml` file. If you want to publish only certain specifications, - you can provide a path to a different file. + Again you'll almost always want to use the top-level `flow.yaml` file. If you want to publish only certain specifications, + you can provide a path to a different file. 5. Return to the web app or use `flowctl catalog list` to check the status of the entities you just published. Their publication time will be updated to reflect the work you just did. diff --git a/site/docs/guides/schema-evolution.md b/site/docs/guides/schema-evolution.md index ccf2f5925f..2976b69b56 100644 --- a/site/docs/guides/schema-evolution.md +++ b/site/docs/guides/schema-evolution.md @@ -173,7 +173,7 @@ Regardless of whether the field is materialized or not, it must still pass schem Database and data warehouse materializations tend to be somewhat restrictive about changing column types. They typically only allow dropping `NOT NULL` constraints. This means that you can safely change a schema to make a required field optional, or to add `null` as a possible type, and the materialization will continue to work normally. Most other types of changes will require materializing into a new table. -The best way to find out whether a change is acceptable to a given connector is to run test or attempt to re-publish. Failed attempts to publish won't affect any tasks that are already running. +The best way to find out whether a change is acceptable to a given connector is to run a test or attempt to re-publish. Failed attempts to publish won't affect any tasks that are already running. **Web app workflow** diff --git a/site/docs/guides/system-specific-dataflows/s3-to-snowflake.md b/site/docs/guides/system-specific-dataflows/s3-to-snowflake.md index de9e8b891f..738a36a157 100644 --- a/site/docs/guides/system-specific-dataflows/s3-to-snowflake.md +++ b/site/docs/guides/system-specific-dataflows/s3-to-snowflake.md @@ -52,7 +52,7 @@ credentials provided by your Estuary account manager. 3. Find the **Amazon S3** tile and click **Capture**. - A form appears with the properties required for an S3 capture. + A form appears with the properties required for an S3 capture. 4. Type a name for your capture. @@ -69,23 +69,23 @@ credentials provided by your Estuary account manager. * **Prefix**: You might organize your S3 bucket using [prefixes](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html), which emulate a directory structure. To capture *only* from a specific prefix, add it here. - * **Match Keys**: Filters to apply to the objects in the S3 bucket. If provided, only data whose absolute path matches the filter will be captured. For example, `*\.json` will only capture JSON file. + * **Match Keys**: Filters to apply to the objects in the S3 bucket. If provided, only data whose absolute path matches the filter will be captured. For example, `*\.json` will only capture JSON files. See the S3 connector documentation for information on [advanced fields](../../reference/Connectors/capture-connectors/amazon-s3.md#endpoint) and [parser settings](../../reference/Connectors/capture-connectors/amazon-s3.md#advanced-parsing-cloud-storage-data). (You're unlikely to need these for most use cases.) 6. Click **Next**. - Flow uses the provided configuration to initiate a connection to S3. + Flow uses the provided configuration to initiate a connection to S3. - It generates a permissive schema and details of the Flow collection that will store the data from S3. + It generates a permissive schema and details of the Flow collection that will store the data from S3. - You'll have the chance to tighten up each collection's JSON schema later, when you materialize to Snowflake. + You'll have the chance to tighten up each collection's JSON schema later, when you materialize to Snowflake. 7. Click **Save and publish**. - You'll see a notification when the capture publishes successfully. + You'll see a notification when the capture publishes successfully. - The data currently in your S3 bucket has been captured, and future updates to it will be captured continuously. + The data currently in your S3 bucket has been captured, and future updates to it will be captured continuously. 8. Click **Materialize Collections** to continue. @@ -95,7 +95,7 @@ Next, you'll add a Snowflake materialization to connect the captured data to its 1. Locate the **Snowflake** tile and click **Materialization**. - A form appears with the properties required for a Snowflake materialization. + A form appears with the properties required for a Snowflake materialization. 2. Choose a unique name for your materialization like you did when naming your capture; for example, `acmeCo/mySnowflakeMaterialization`. @@ -112,12 +112,12 @@ Next, you'll add a Snowflake materialization to connect the captured data to its 4. Click **Next**. - Flow uses the provided configuration to initiate a connection to Snowflake. + Flow uses the provided configuration to initiate a connection to Snowflake. - You'll be notified if there's an error. In that case, fix the configuration form or Snowflake setup as needed and click **Next** to try again. + You'll be notified if there's an error. In that case, fix the configuration form or Snowflake setup as needed and click **Next** to try again. - Once the connection is successful, the Endpoint Config collapses and the **Source Collections** browser becomes prominent. - It shows the collection you captured previously, which will be mapped to a Snowflake table. + Once the connection is successful, the Endpoint Config collapses and the **Source Collections** browser becomes prominent. + It shows the collection you captured previously, which will be mapped to a Snowflake table. 5. In the **Collection Selector**, optionally change the name in the **Table** field. @@ -127,9 +127,9 @@ Next, you'll add a Snowflake materialization to connect the captured data to its 7. Apply a stricter schema to the collection for the materialization. - S3 has a flat data structure. - To materialize this data effectively to Snowflake, you should apply a schema that can translate to a table structure. - Flow's **Schema Inference** tool can help. + S3 has a flat data structure. + To materialize this data effectively to Snowflake, you should apply a schema that can translate to a table structure. + Flow's **Schema Inference** tool can help. 1. In the **Source Collections** browser, click the collection's **Collection** tab. diff --git a/site/docs/guides/transform_data_using_typescript.md b/site/docs/guides/transform_data_using_typescript.md index 3df0a292a4..53950c6d0a 100644 --- a/site/docs/guides/transform_data_using_typescript.md +++ b/site/docs/guides/transform_data_using_typescript.md @@ -273,7 +273,8 @@ You can use `flowctl` to quickly verify your derivation before publishing it. Us As you can see, the output format matches the defined schema.  The last step would be to publish your derivation to Flow, which you can also do using `flowctl`. -:::warning Publishing the derivation will initialize the transformation on the live, real-time Wikipedia stream, make sure to delete it after completing the tutorial. +:::warning +Publishing the derivation will initialize the transformation on the live, real-time Wikipedia stream, make sure to delete it after completing the tutorial. ::: ```shell