From ecf301bf7a5df2a340b4e770c6f1d920aa14b804 Mon Sep 17 00:00:00 2001 From: Ryan Kuo Date: Wed, 18 Dec 2024 17:01:18 -0500 Subject: [PATCH] MOLT Fetch 1.2.1 updates --- .../_includes/molt/replicator-flags.md | 86 +++++----- src/current/molt/molt-fetch.md | 158 ++++++++++-------- src/current/releases/molt.md | 12 ++ 3 files changed, 147 insertions(+), 109 deletions(-) diff --git a/src/current/_includes/molt/replicator-flags.md b/src/current/_includes/molt/replicator-flags.md index bb8cd855d6f..18f04714a35 100644 --- a/src/current/_includes/molt/replicator-flags.md +++ b/src/current/_includes/molt/replicator-flags.md @@ -2,44 +2,44 @@ The following flags are set with [`--replicator-flags`](#global-flags) and can be used in any [Fetch mode](#fetch-mode) that involves replication. -| Flag | Type | Description | -|------------------------------|------------|---------------------------------------------------------------------------------------------------------------------------------------------------| -| `--applyTimeout` | `DURATION` | The maximum amount of time to wait for an update to be applied.

**Default:** `30s` | -| `--dlqTableName` | `IDENT` | The name of a table in the target schema for storing dead-letter entries.

**Default:** `replicator_dlq` | -| `--flushPeriod` | `DURATION` | Flush queued mutations after this duration.

**Default:** `1s` | -| `--flushSize` | `INT` | Ideal batch size to determine when to flush mutations.

**Default:** `1000` | -| `--gracePeriod` | `DURATION` | Allow background processes to exit.

**Default:** `30s` | -| `--logDestination` | `STRING` | Write logs to a file. If not specified, write logs to `stdout`. | -| `--logFormat` | `STRING` | Choose log output format: `"fluent"`, `"text"`.

**Default:** `"text"` | -| `--metricsAddr` | `STRING` | A `host:port` on which to serve metrics and diagnostics. | -| `--parallelism` | `INT` | The number of concurrent database transactions to use.

**Default:** `16` | -| `--quiescentPeriod` | `DURATION` | How often to retry deferred mutations.

**Default:** `10s` | -| `--retireOffset` | `DURATION` | How long to delay removal of applied mutations.

**Default:** `24h0m0s` | -| `--scanSize` | `INT` | The number of rows to retrieve from the staging database used to store metadata for [replication modes](#fetch-mode).

**Default:** `10000` | -| `--schemaRefresh` | `DURATION` | How often a watcher will refresh its schema. If this value is zero or negative, refresh behavior will be disabled.

**Default:** `1m0s` | -| `--sourceConn` | `STRING` | The source database's connection string. | -| `--stageMarkAppliedLimit` | `INT` | Limit the number of mutations to be marked applied in a single statement.

**Default:** `100000` | -| `--stageSanityCheckPeriod` | `DURATION` | How often to validate staging table apply order (`-1` to disable).

**Default:** `10m0s` | -| `--stageSanityCheckWindow` | `DURATION` | How far back to look when validating staging table apply order.

**Default:** `1h0m0s` | -| `--stageUnappliedPeriod` | `DURATION` | How often to report the number of unapplied mutations in staging tables (`-1` to disable).

**Default:** `1m0s` | -| `--stagingConn` | `STRING` | The staging database's connection string. | -| `--stagingCreateSchema` | | Automatically create the staging schema if it does not exist. | -| `--stagingIdleTime` | `DURATION` | Maximum lifetime of an idle connection.

**Default:** `1m0s` | -| `--stagingJitterTime` | `DURATION` | The time over which to jitter database pool disconnections.

**Default:** `15s` | -| `--stagingMaxLifetime` | `DURATION` | The maximum lifetime of a database connection.

**Default:** `5m0s` | -| `--stagingMaxPoolSize` | `INT` | The maximum number of staging database connections.

**Default:** `128` | -| `--stagingSchema` | `ATOM` | A SQL database schema to store metadata in.

**Default:** `_replicator.public` | -| `--targetConn` | `STRING` | The target database's connection string. | -| `--targetIdleTime` | `DURATION` | Maximum lifetime of an idle connection.

**Default:** `1m0s` | -| `--targetJitterTime` | `DURATION` | The time over which to jitter database pool disconnections.

**Default:** `15s` | -| `--targetMaxLifetime` | `DURATION` | The maximum lifetime of a database connection.

**Default:** `5m0s` | -| `--targetMaxPoolSize` | `INT` | The maximum number of target database connections.

**Default:** `128` | -| `--targetSchema` | `ATOM` | The SQL database schema in the target cluster to update. | -| `--targetStatementCacheSize` | `INT` | The maximum number of prepared statements to retain.

**Default:** `128` | -| `--taskGracePeriod` | `DURATION` | How long to allow for task cleanup when recovering from errors.

**Default:** `1m0s` | -| `--timestampLimit` | `INT` | The maximum number of source timestamps to coalesce into a target transaction.

**Default:** `1000` | -| `--userscript` | `STRING` | The path to a configuration script, see `userscript` subcommand. | -| `-v`, `--verbose` | `COUNT` | Increase logging verbosity to `debug`; repeat for `trace`. | +| Flag | Type | Description | +|------------------------------|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `--applyTimeout` | `DURATION` | The maximum amount of time to wait for an update to be applied.

**Default:** `30s` | +| `--dlqTableName` | `IDENT` | The name of a table in the target schema for storing dead-letter entries.

**Default:** `replicator_dlq` | +| `--flushPeriod` | `DURATION` | Flush queued mutations after this duration.

**Default:** `1s` | +| `--flushSize` | `INT` | Ideal batch size to determine when to flush mutations.

**Default:** `1000` | +| `--gracePeriod` | `DURATION` | Allow background processes to exit.

**Default:** `30s` | +| `--logDestination` | `STRING` | Write logs to a file. If not specified, write logs to `stdout`. | +| `--logFormat` | `STRING` | Choose log output format: `"fluent"`, `"text"`.

**Default:** `"text"` | +| `--metricsAddr` | `STRING` | A `host:port` on which to serve metrics and diagnostics. | +| `--parallelism` | `INT` | The number of concurrent database transactions to use.

**Default:** `16` | +| `--quiescentPeriod` | `DURATION` | How often to retry deferred mutations.

**Default:** `10s` | +| `--retireOffset` | `DURATION` | How long to delay removal of applied mutations.

**Default:** `24h0m0s` | +| `--scanSize` | `INT` | The number of rows to retrieve from the staging database used to store metadata for [replication modes](#fetch-mode).

**Default:** `10000` | +| `--schemaRefresh` | `DURATION` | How often a watcher will refresh its schema. If this value is zero or negative, refresh behavior will be disabled.

**Default:** `1m0s` | +| `--sourceConn` | `STRING` | The source database's connection string. | +| `--stageMarkAppliedLimit` | `INT` | Limit the number of mutations to be marked applied in a single statement.

**Default:** `100000` | +| `--stageSanityCheckPeriod` | `DURATION` | How often to validate staging table apply order (`-1` to disable).

**Default:** `10m0s` | +| `--stageSanityCheckWindow` | `DURATION` | How far back to look when validating staging table apply order.

**Default:** `1h0m0s` | +| `--stageUnappliedPeriod` | `DURATION` | How often to report the number of unapplied mutations in staging tables (`-1` to disable).

**Default:** `1m0s` | +| `--stagingConn` | `STRING` | The staging database's connection string. | +| `--stagingCreateSchema` | | Automatically create the staging schema if it does not exist. | +| `--stagingIdleTime` | `DURATION` | Maximum lifetime of an idle connection.

**Default:** `1m0s` | +| `--stagingJitterTime` | `DURATION` | The time over which to jitter database pool disconnections.

**Default:** `15s` | +| `--stagingMaxLifetime` | `DURATION` | The maximum lifetime of a database connection.

**Default:** `5m0s` | +| `--stagingMaxPoolSize` | `INT` | The maximum number of staging database connections.

**Default:** `128` | +| `--stagingSchema` | `ATOM` | Name of the SQL database schema that stores replication metadata. **Required** each time [`--mode replication-only`](#replicate-changes) is rerun after being interrupted, as the schema provides a replication marker for streaming changes. For details, refer to [Replicate changes](#replicate-changes).

**Default:** `_replicator.public` | +| `--targetConn` | `STRING` | The target database's connection string. | +| `--targetIdleTime` | `DURATION` | Maximum lifetime of an idle connection.

**Default:** `1m0s` | +| `--targetJitterTime` | `DURATION` | The time over which to jitter database pool disconnections.

**Default:** `15s` | +| `--targetMaxLifetime` | `DURATION` | The maximum lifetime of a database connection.

**Default:** `5m0s` | +| `--targetMaxPoolSize` | `INT` | The maximum number of target database connections.

**Default:** `128` | +| `--targetSchema` | `ATOM` | The SQL database schema in the target cluster to update. | +| `--targetStatementCacheSize` | `INT` | The maximum number of prepared statements to retain.

**Default:** `128` | +| `--taskGracePeriod` | `DURATION` | How long to allow for task cleanup when recovering from errors.

**Default:** `1m0s` | +| `--timestampLimit` | `INT` | The maximum number of source timestamps to coalesce into a target transaction.

**Default:** `1000` | +| `--userscript` | `STRING` | The path to a configuration script, see `userscript` subcommand. | +| `-v`, `--verbose` | `COUNT` | Increase logging verbosity to `debug`; repeat for `trace`. | ##### PostgreSQL replication flags @@ -55,11 +55,11 @@ The following flags are set with [`--replicator-flags`](#global-flags) and can b The following flags are set with [`--replicator-flags`](#global-flags) and can be used in any [Fetch mode](#fetch-mode) that involves replication from a [MySQL source database](#source-and-target-databases). -| Flag | Type | Description | -|--------------------------|----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `--defaultGTIDSet` | `STRING` | Default GTID set, in the format `source_uuid:min(interval_start)-max(interval_end)`. **Required** the first time [`--mode replication-only`](#replicate-changes) is run, as this provides a replication marker for streaming changes. | -| `--fetchMetadata` | | Fetch column metadata explicitly, for older versions of MySQL that don't support `binlog_row_metadata`. | -| `--replicationProcessID` | `UINT32` | The replication process ID to report to the source database.

**Default:** `10` | +| Flag | Type | Description | +|--------------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `--defaultGTIDSet` | `STRING` | Default GTID set, in the format `source_uuid:min(interval_start)-max(interval_end)`. **Required** the first time [`--mode replication-only`](#replicate-changes) is run, as the GTID set provides a replication marker for streaming changes. For details, refer to [Replicate changes](#replicate-changes). | +| `--fetchMetadata` | | Fetch column metadata explicitly, for older versions of MySQL that don't support `binlog_row_metadata`. | +| `--replicationProcessID` | `UINT32` | The replication process ID to report to the source database.

**Default:** `10` | ##### Failback replication flags diff --git a/src/current/molt/molt-fetch.md b/src/current/molt/molt-fetch.md index ab498f491c5..f52af203078 100644 --- a/src/current/molt/molt-fetch.md +++ b/src/current/molt/molt-fetch.md @@ -37,7 +37,7 @@ Complete the following items before using MOLT Fetch: - For PostgreSQL sources, enable logical replication. In `postgresql.conf` or in the SQL shell, set [`wal_level`](https://www.postgresql.org/docs/current/runtime-config-wal.html) to `logical`. - - For MySQL sources, enable [GTID](https://dev.mysql.com/doc/refman/8.0/en/replication-options-gtids.html) consistency. In `mysql.cnf`, in the SQL shell, or as flags in the `mysql` start command, set `gtid-mode` and `enforce-gtid-consistency` to `ON` and set `binlog_row_metadata` to `full`. + - For MySQL **8.0 and later** sources, enable [GTID](https://dev.mysql.com/doc/refman/8.0/en/replication-options-gtids.html) consistency. In `mysql.cnf`, in the SQL shell, or as flags in the `mysql` start command, set `gtid-mode` and `enforce-gtid-consistency` to `ON` and set `binlog_row_metadata` to `full`. For MySQL **5.7** sources, in addition to the preceding settings, also set `log-bin` to `log-bin` and `server-id` to a unique integer that differs from any other MySQL server you have in your cluster (e.g., `3`). - Percent-encode the connection strings for the source database and [CockroachDB]({% link {{site.current_cloud_version}}/connect-to-the-database.md %}). This ensures that the MOLT tools can parse special characters in your password. @@ -73,7 +73,7 @@ Complete the following items before using MOLT Fetch: - If a MySQL database is set as a [source](#source-and-target-databases), the [`--table-concurrency`](#global-flags) and [`--export-concurrency`](#global-flags) flags **cannot** be set above `1`. If these values are changed, MOLT Fetch returns an error. This guarantees consistency when moving data from MySQL, due to MySQL limitations. MySQL data is migrated to CockroachDB one table and shard at a time, using [`WITH CONSISTENT SNAPSHOT`](https://dev.mysql.com/doc/refman/8.0/en/commit.html) transactions. -- To prevent memory outages during data export of tables with large rows, estimate the amount of memory used to export a table: +- To prevent memory outages during `READ COMMITTED` data export of tables with large rows, estimate the amount of memory used to export a table: ~~~ --row-batch-size * --export-concurrency * average size of the table rows @@ -185,49 +185,52 @@ To verify that your connections and configuration work properly, run MOLT Fetch ### Global flags -| Flag | Description | -|-----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `--source` | (Required) Connection string for the source database. For details, see [Source and target databases](#source-and-target-databases). | -| `--target` | (Required) Connection string for the target database. For details, see [Source and target databases](#source-and-target-databases). | -| `--allow-tls-mode-disable` | Allow insecure connections to databases. Secure SSL/TLS connections should be used by default. This should be enabled **only** if secure SSL/TLS connections to the source or target database are not possible. | -| `--bucket-path` | The path within the [cloud storage](#cloud-storage) bucket where intermediate files are written (e.g., `'s3://bucket/path'` or `'gs://bucket/path'`). Only the path is used; query parameters (e.g., credentials) are ignored. | -| `--changefeeds-path` | Path to a JSON file that contains changefeed override settings for [failback](#fail-back-to-source-database), when enabled with `--mode failback`. If not specified, an insecure default configuration is used, and `--allow-tls-mode-disable` must be included. For details, see [Fail back to source database](#fail-back-to-source-database). | -| `--cleanup` | Whether to delete intermediate files after moving data using [cloud or local storage](#data-path). **Note:** Cleanup does not occur on [continuation](#fetch-continuation). | -| `--compression` | Compression method for data when using [`IMPORT INTO`](#data-movement) (`gzip`/`none`).

**Default:** `gzip` | -| `--continuation-file-name` | Restart fetch at the specified filename if the task encounters an error. `--fetch-id` must be specified. For details, see [Fetch continuation](#fetch-continuation). | -| `--continuation-token` | Restart fetch at a specific table, using the specified continuation token, if the task encounters an error. `--fetch-id` must be specified. For details, see [Fetch continuation](#fetch-continuation). | -| `--crdb-pts-duration` | The duration for which each timestamp used in data export from a CockroachDB source is protected from garbage collection. This ensures that the data snapshot remains consistent. For example, if set to `24h`, each timestamp is protected for 24 hours from the initiation of the export job. This duration is extended at regular intervals specified in `--crdb-pts-refresh-interval`.

**Default:** `24h0m0s` | -| `--crdb-pts-refresh-interval` | The frequency at which the protected timestamp's validity is extended. This interval maintains protection of the data snapshot until data export from a CockroachDB source is completed. For example, if set to `10m`, the protected timestamp's expiration will be extended by the duration specified in `--crdb-pts-duration` (e.g., `24h`) every 10 minutes while export is not complete.

**Default:** `10m0s` | -| `--direct-copy` | Enables [direct copy](#direct-copy), which copies data directly from source to target without using an intermediate store. | -| `--export-concurrency` | Number of shards to export at a time, each on a dedicated thread. This only applies when exporting data from the source database, not when loading data into the target database. Only tables with [primary key]({% link {{ site.current_cloud_version }}/primary-key.md %}) types of [`INT`]({% link {{ site.current_cloud_version }}/int.md %}), [`FLOAT`]({% link {{ site.current_cloud_version }}/float.md %}), or [`UUID`]({% link {{ site.current_cloud_version }}/uuid.md %}) can be sharded. The number of concurrent threads is the product of `--export-concurrency` and `--table-concurrency`.

This value **cannot** be set higher than `1` when moving data from MySQL. Refer to [Best practices](#best-practices).

**Default:** `4` with a PostgreSQL source; `1` with a MySQL source | -| `--fetch-id` | Restart fetch task corresponding to the specified ID. If `--continuation-file-name` or `--continuation-token` are not specified, fetch restarts for all failed tables. | -| `--flush-rows` | Number of rows before the source data is flushed to intermediate files. **Note:** If `--flush-size` is also specified, the fetch behavior is based on the flag whose criterion is met first. | -| `--flush-size` | Size (in bytes) before the source data is flushed to intermediate files. **Note:** If `--flush-rows` is also specified, the fetch behavior is based on the flag whose criterion is met first. | -| `--import-batch-size` | The number of files to be imported at a time to the target database. This applies only when using [`IMPORT INTO`](#data-movement) to load data into the target. **Note:** Increasing this value can improve the performance of full-scan queries on the target database shortly after fetch completes, but very high values are not recommended. If any individual file in the import batch fails, you must [retry](#fetch-continuation) the entire batch.

**Default:** `1000` | -| `--local-path` | The path within the [local file server](#local-file-server) where intermediate files are written (e.g., `data/migration/cockroach`). `--local-path-listen-addr` must be specified. | -| `--local-path-crdb-access-addr` | Address of a [local file server](#local-file-server) that is **publicly accessible**. This flag is only necessary if CockroachDB cannot reach the local address specified with `--local-path-listen-addr` (e.g., when moving data to a CockroachDB {{ site.data.products.cloud }} deployment). `--local-path` and `--local-path-listen-addr` must be specified.

**Default:** Value of `--local-path-listen-addr`. | -| `--local-path-listen-addr` | Write intermediate files to a [local file server](#local-file-server) at the specified address (e.g., `'localhost:3000'`). `--local-path` must be specified. | -| `--log-file` | Write messages to the specified log filename. If no filename is provided, messages write to `fetch-{datetime}.log`. If `"stdout"` is provided, messages write to `stdout`. | -| `--logging` | Level at which to log messages (`trace`/`debug`/`info`/`warn`/`error`/`fatal`/`panic`).

**Default:** `info` | -| `--metrics-listen-addr` | Address of the Prometheus metrics endpoint, which has the path `{address}/metrics`. For details on important metrics to monitor, see [Metrics](#metrics).

**Default:** `'127.0.0.1:3030'` | -| `--mode` | Configure the MOLT Fetch behavior: `data-load`, `data-load-and-replication`, `replication-only`, `export-only`, or `import-only`. For details, refer to [Fetch mode](#fetch-mode).

**Default:** `data-load` | -| `--non-interactive` | Run the fetch task without interactive prompts. This is recommended **only** when running `molt fetch` in an automated process (i.e., a job or continuous integration). | -| `--pglogical-replication-slot-drop-if-exists` | Drop the replication slot, if specified with `--pglogical-replication-slot-name`. Otherwise, the default replication slot is not dropped. | -| `--pglogical-replication-slot-name` | The name of a replication slot to create before taking a snapshot of data (e.g., `'fetch'`). **Required** in order to perform continuous [replication](#load-data-and-replicate-changes) from a source PostgreSQL database. | -| `--pglogical-replication-slot-plugin` | The output plugin used for logical replication under `--pglogical-replication-slot-name`.

**Default:** `pgoutput` | -| `--pprof-listen-addr` | Address of the pprof endpoint.

**Default:** `'127.0.0.1:3031'` | -| `--replicator-flags` | If continuous [replication](#load-data-and-replicate-changes) is enabled with `--mode data-load-and-replication`, `--mode replication-only`, or `--mode failback`, specify [replication flags](#replication-flags) to override. For example: `--replicator-flags "--tlsCertificate ./certs/server.crt --tlsPrivateKey ./certs/server.key"` | -| `--row-batch-size` | Number of rows per shard to export at a time. See [Best practices](#best-practices).

**Default:** `100000` | -| `--schema-filter` | Move schemas that match a specified [regular expression](https://wikipedia.org/wiki/Regular_expression).

**Default:** `'.*'` | -| `--table-concurrency` | Number of tables to export at a time. The number of concurrent threads is the product of `--export-concurrency` and `--table-concurrency`.

This value **cannot** be set higher than `1` when moving data from MySQL. Refer to [Best practices](#best-practices).

**Default:** `4` with a PostgreSQL source; `1` with a MySQL source | -| `--table-exclusion-filter` | Exclude tables that match a specified [POSIX regular expression](https://wikipedia.org/wiki/Regular_expression).

This value **cannot** be set to `'.*'`, which would cause every table to be excluded.

**Default:** Empty string | -| `--table-filter` | Move tables that match a specified [POSIX regular expression](https://wikipedia.org/wiki/Regular_expression).

**Default:** `'.*'` | -| `--table-handling` | How tables are initialized on the target database (`none`/`drop-on-target-and-recreate`/`truncate-if-exists`). For details, see [Target table handling](#target-table-handling).

**Default:** `none` | -| `--transformations-file` | Path to a JSON file that defines transformations to be performed on the target schema during the fetch task. Refer to [Transformations](#transformations). | -| `--type-map-file` | Path to a JSON file that contains explicit type mappings for automatic schema creation, when enabled with `--table-handling drop-on-target-and-recreate`. For details on the JSON format and valid type mappings, see [type mapping](#type-mapping). | -| `--use-console-writer` | Use the console writer, which has cleaner log output but introduces more latency.

**Default:** `false` (log as structured JSON) | -| `--use-copy` | Use [`COPY FROM`](#data-movement) to move data. This makes tables queryable during data load, but is slower than using `IMPORT INTO`. For details, refer to [Data movement](#data-movement). | -| `--use-implicit-auth` | Use [implicit authentication]({% link {{ site.current_cloud_version }}/cloud-storage-authentication.md %}) for [cloud storage](#cloud-storage) URIs. | +| Flag | Description | +|------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `--source` | (Required) Connection string for the source database. For details, see [Source and target databases](#source-and-target-databases). | +| `--target` | (Required) Connection string for the target database. For details, see [Source and target databases](#source-and-target-databases). | +| `--allow-tls-mode-disable` | Allow insecure connections to databases. Secure SSL/TLS connections should be used by default. This should be enabled **only** if secure SSL/TLS connections to the source or target database are not possible. | +| `--assume-role` | Service account to use for assume role authentication. `--use-implicit-auth` must be included. For example, `--assume-role='user-test@cluster-ephemeral.iam.gserviceaccount.com' --use-implicit-auth`. For details, refer to [Cloud Storage Authentication]({% link {{ site.current_cloud_version }}/cloud-storage-authentication.md %}). | +| `--bucket-path` | The path within the [cloud storage](#cloud-storage) bucket where intermediate files are written (e.g., `'s3://bucket/path'` or `'gs://bucket/path'`). Only the path is used; query parameters (e.g., credentials) are ignored. | +| `--changefeeds-path` | Path to a JSON file that contains changefeed override settings for [failback](#fail-back-to-source-database), when enabled with `--mode failback`. If not specified, an insecure default configuration is used, and `--allow-tls-mode-disable` must be included. For details, see [Fail back to source database](#fail-back-to-source-database). | +| `--cleanup` | Whether to delete intermediate files after moving data using [cloud or local storage](#data-path). **Note:** Cleanup does not occur on [continuation](#fetch-continuation). | +| `--compression` | Compression method for data when using [`IMPORT INTO`](#data-movement) (`gzip`/`none`).

**Default:** `gzip` | +| `--continuation-file-name` | Restart fetch at the specified filename if the task encounters an error. `--fetch-id` must be specified. For details, see [Fetch continuation](#fetch-continuation). | +| `--continuation-token` | Restart fetch at a specific table, using the specified continuation token, if the task encounters an error. `--fetch-id` must be specified. For details, see [Fetch continuation](#fetch-continuation). | +| `--crdb-pts-duration` | The duration for which each timestamp used in data export from a CockroachDB source is protected from garbage collection. This ensures that the data snapshot remains consistent. For example, if set to `24h`, each timestamp is protected for 24 hours from the initiation of the export job. This duration is extended at regular intervals specified in `--crdb-pts-refresh-interval`.

**Default:** `24h0m0s` | +| `--crdb-pts-refresh-interval` | The frequency at which the protected timestamp's validity is extended. This interval maintains protection of the data snapshot until data export from a CockroachDB source is completed. For example, if set to `10m`, the protected timestamp's expiration will be extended by the duration specified in `--crdb-pts-duration` (e.g., `24h`) every 10 minutes while export is not complete.

**Default:** `10m0s` | +| `--direct-copy` | Enables [direct copy](#direct-copy), which copies data directly from source to target without using an intermediate store. | +| `--export-concurrency` | Number of shards to export at a time, each on a dedicated thread. This only applies when exporting data from the source database, not when loading data into the target database. Only tables with [primary key]({% link {{ site.current_cloud_version }}/primary-key.md %}) types of [`INT`]({% link {{ site.current_cloud_version }}/int.md %}), [`FLOAT`]({% link {{ site.current_cloud_version }}/float.md %}), or [`UUID`]({% link {{ site.current_cloud_version }}/uuid.md %}) can be sharded. The number of concurrent threads is the product of `--export-concurrency` and `--table-concurrency`.

This value **cannot** be set higher than `1` when moving data from MySQL. Refer to [Best practices](#best-practices).

**Default:** `4` with a PostgreSQL source; `1` with a MySQL source | +| `--fetch-id` | Restart fetch task corresponding to the specified ID. If `--continuation-file-name` or `--continuation-token` are not specified, fetch restarts for all failed tables. | +| `--flush-rows` | Number of rows before the source data is flushed to intermediate files. **Note:** If `--flush-size` is also specified, the fetch behavior is based on the flag whose criterion is met first. | +| `--flush-size` | Size (in bytes) before the source data is flushed to intermediate files. **Note:** If `--flush-rows` is also specified, the fetch behavior is based on the flag whose criterion is met first. | +| `--import-batch-size` | The number of files to be imported at a time to the target database. This applies only when using [`IMPORT INTO`](#data-movement) to load data into the target. **Note:** Increasing this value can improve the performance of full-scan queries on the target database shortly after fetch completes, but very high values are not recommended. If any individual file in the import batch fails, you must [retry](#fetch-continuation) the entire batch.

**Default:** `1000` | +| `--local-path` | The path within the [local file server](#local-file-server) where intermediate files are written (e.g., `data/migration/cockroach`). `--local-path-listen-addr` must be specified. | +| `--local-path-crdb-access-addr` | Address of a [local file server](#local-file-server) that is **publicly accessible**. This flag is only necessary if CockroachDB cannot reach the local address specified with `--local-path-listen-addr` (e.g., when moving data to a CockroachDB {{ site.data.products.cloud }} deployment). `--local-path` and `--local-path-listen-addr` must be specified.

**Default:** Value of `--local-path-listen-addr`. | +| `--local-path-listen-addr` | Write intermediate files to a [local file server](#local-file-server) at the specified address (e.g., `'localhost:3000'`). `--local-path` must be specified. | +| `--log-file` | Write messages to the specified log filename. If no filename is provided, messages write to `fetch-{datetime}.log`. If `"stdout"` is provided, messages write to `stdout`. | +| `--logging` | Level at which to log messages (`trace`/`debug`/`info`/`warn`/`error`/`fatal`/`panic`).

**Default:** `info` | +| `--metrics-listen-addr` | Address of the Prometheus metrics endpoint, which has the path `{address}/metrics`. For details on important metrics to monitor, see [Metrics](#metrics).

**Default:** `'127.0.0.1:3030'` | +| `--mode` | Configure the MOLT Fetch behavior: `data-load`, `data-load-and-replication`, `replication-only`, `export-only`, or `import-only`. For details, refer to [Fetch mode](#fetch-mode).

**Default:** `data-load` | +| `--non-interactive` | Run the fetch task without interactive prompts. This is recommended **only** when running `molt fetch` in an automated process (i.e., a job or continuous integration). | +| `--pglogical-publication-name` | If set, the name of the [publication](https://www.postgresql.org/docs/current/logical-replication-publication.html) that will be created or used for replication. Used in [`replication-only`](#replicate-changes) mode.

**Default:** `molt_fetch` | +| `--pglogical-publication-and-slot-drop-and-recreate` | If set, drops the [publication](https://www.postgresql.org/docs/current/logical-replication-publication.html) and slots if they exist and then recreates them. Used in [`replication-only`](#replicate-changes) mode. | +| `--pglogical-replication-slot-name` | The name of a replication slot to create before taking a snapshot of data (e.g., `'fetch'`). **Required** in order to perform continuous [replication](#load-data-and-replicate-changes) from a source PostgreSQL database. | +| `--pglogical-replication-slot-plugin` | The output plugin used for logical replication under `--pglogical-replication-slot-name`.

**Default:** `pgoutput` | +| `--pprof-listen-addr` | Address of the pprof endpoint.

**Default:** `'127.0.0.1:3031'` | +| `--replicator-flags` | If continuous [replication](#load-data-and-replicate-changes) is enabled with `--mode data-load-and-replication`, `--mode replication-only`, or `--mode failback`, specify [replication flags](#replication-flags) to override. For example: `--replicator-flags "--tlsCertificate ./certs/server.crt --tlsPrivateKey ./certs/server.key"` | +| `--row-batch-size` | Number of rows per shard to export at a time. See [Best practices](#best-practices).

**Default:** `100000` | +| `--schema-filter` | Move schemas that match a specified [regular expression](https://wikipedia.org/wiki/Regular_expression).

**Default:** `'.*'` | +| `--table-concurrency` | Number of tables to export at a time. The number of concurrent threads is the product of `--export-concurrency` and `--table-concurrency`.

This value **cannot** be set higher than `1` when moving data from MySQL. Refer to [Best practices](#best-practices).

**Default:** `4` with a PostgreSQL source; `1` with a MySQL source | +| `--table-exclusion-filter` | Exclude tables that match a specified [POSIX regular expression](https://wikipedia.org/wiki/Regular_expression).

This value **cannot** be set to `'.*'`, which would cause every table to be excluded.

**Default:** Empty string | +| `--table-filter` | Move tables that match a specified [POSIX regular expression](https://wikipedia.org/wiki/Regular_expression).

**Default:** `'.*'` | +| `--table-handling` | How tables are initialized on the target database (`none`/`drop-on-target-and-recreate`/`truncate-if-exists`). For details, see [Target table handling](#target-table-handling).

**Default:** `none` | +| `--transformations-file` | Path to a JSON file that defines transformations to be performed on the target schema during the fetch task. Refer to [Transformations](#transformations). | +| `--type-map-file` | Path to a JSON file that contains explicit type mappings for automatic schema creation, when enabled with `--table-handling drop-on-target-and-recreate`. For details on the JSON format and valid type mappings, see [type mapping](#type-mapping). | +| `--use-console-writer` | Use the console writer, which has cleaner log output but introduces more latency.

**Default:** `false` (log as structured JSON) | +| `--use-copy` | Use [`COPY FROM`](#data-movement) to move data. This makes tables queryable during data load, but is slower than using `IMPORT INTO`. For details, refer to [Data movement](#data-movement). | +| `--use-implicit-auth` | Use [implicit authentication]({% link {{ site.current_cloud_version }}/cloud-storage-authentication.md %}) for [cloud storage](#cloud-storage) URIs. | + ### `tokens list` flags @@ -290,10 +293,22 @@ MySQL: --mode data-load ~~~ +If the source is a PostgreSQL database and you intend to [replicate changes](#replicate-changes) afterward, **also** specify a replication slot name with `--pglogical-replication-slot-name`. MOLT Fetch will create a replication slot with this name. For example, the following snippet instructs MOLT Fetch to create a slot named `replication_slot` to use for replication: + +{% include_cached copy-clipboard.html %} +~~~ +--mode data-load +--pglogical-replication-slot-name 'replication_slot' +~~~ + +{{site.data.alerts.callout_success}} +In case you need to rename your [publication](https://www.postgresql.org/docs/current/logical-replication-publication.html), also include `--pglogical-publication-name` to specify the new publication name and `--pglogical-publication-and-slot-drop-and-recreate` to ensure that the publication and replication slot are created in the correct order. For details on these flags, refer to [Global flags](#global-flags). +{{site.data.alerts.end}} + #### Load data and replicate changes {{site.data.alerts.callout_info}} -Before using this option, the source PostgreSQL or MySQL database **must** be configured for continuous replication, as described in [Setup](#replication-setup). MySQL 8.0 and later are supported. +Before using this option, the source PostgreSQL or MySQL database **must** be configured for continuous replication, as described in [Setup](#replication-setup). MySQL 5.7 and later are supported. {{site.data.alerts.end}} `data-load-and-replication` instructs MOLT Fetch to load the source data into CockroachDB, and replicate any subsequent changes on the source. @@ -303,7 +318,7 @@ Before using this option, the source PostgreSQL or MySQL database **must** be co --mode data-load-and-replication ~~~ -If the source is a PostgreSQL database, you must also specify a replication slot name. For example, the following snippet instructs MOLT Fetch to create a slot named `replication_slot` to use for replication: +If the source is a PostgreSQL database, you **must** also specify a replication slot name with `--pglogical-replication-slot-name`. MOLT Fetch will create a replication slot with this name. For example, the following snippet instructs MOLT Fetch to create a slot named `replication_slot` to use for replication: {% include_cached copy-clipboard.html %} ~~~ @@ -311,6 +326,10 @@ If the source is a PostgreSQL database, you must also specify a replication slot --pglogical-replication-slot-name 'replication_slot' ~~~ +{{site.data.alerts.callout_success}} +In case you need to rename your [publication](https://www.postgresql.org/docs/current/logical-replication-publication.html), also include `--pglogical-publication-name` to specify the new publication name and `--pglogical-publication-and-slot-drop-and-recreate` to ensure that the publication and replication slot are created in the correct order. For details on these flags, refer to [Global flags](#global-flags). +{{site.data.alerts.end}} + Continuous replication begins once the initial load is complete, as indicated by a `fetch complete` message in the output. To cancel replication, enter `ctrl-c` to issue a `SIGTERM` signal. This returns an exit code `0`. If replication fails, a non-zero exit code is returned. @@ -326,19 +345,12 @@ To customize the replication behavior (an advanced use case), use `--replicator- #### Replicate changes {{site.data.alerts.callout_info}} -Before using this option, the source PostgreSQL or MySQL database **must** be configured for continuous replication, as described in [Setup](#replication-setup). MySQL 8.0 and later are supported. +Before using this option, the source PostgreSQL or MySQL database **must** be configured for continuous replication, as described in [Setup](#replication-setup). MySQL 5.7 and later are supported. {{site.data.alerts.end}} -`replication-only` instructs MOLT Fetch to replicate ongoing changes on the source to CockroachDB, using the specified replication marker. +`replication-only` instructs MOLT Fetch to replicate ongoing changes on the source to CockroachDB, using the specified replication marker. This assumes you have already run [`--mode load-data`](#load-data) to load the source data into CockroachDB. -- For a PostgreSQL source, first create a logical replication slot. For example, to create a replication slot named `replication_slot`: - - {% include_cached copy-clipboard.html %} - ~~~ sql - SELECT * FROM pg_create_logical_replication_slot('replication_slot', 'pgoutput'); - ~~~ - - In the `molt fetch` command, specify the replication slot name using `--pglogical-replication-slot-name`. For example: +- For a PostgreSQL source, you should have already created a replication slot when [loading data](#load-data). Specify the same replication slot name using `--pglogical-replication-slot-name`. For example: {% include_cached copy-clipboard.html %} ~~~ @@ -346,6 +358,10 @@ Before using this option, the source PostgreSQL or MySQL database **must** be co --pglogical-replication-slot-name 'replication_slot' ~~~ + {{site.data.alerts.callout_success}} + In case you want to run `replication-only` without already having loaded data (e.g., for testing), also include `--pglogical-publication-name` to specify a [publication](https://www.postgresql.org/docs/current/logical-replication-publication.html) name and `--pglogical-publication-and-slot-drop-and-recreate` to ensure that the publication and replication slot are created in the correct order. For details on these flags, refer to [Global flags](#global-flags). + {{site.data.alerts.end}} + - For a MySQL source, first get your GTID record: {% include_cached copy-clipboard.html %} @@ -355,7 +371,7 @@ Before using this option, the source PostgreSQL or MySQL database **must** be co GROUP BY source_uuid; ~~~ - In the `molt fetch` command, [specify a GTID set](#mysql-replication-flags) using the format `source_uuid:min(interval_start)-max(interval_end)`. For example: + In the `molt fetch` command, specify a GTID set using the [`--defaultGTIDSet` replication flag](#mysql-replication-flags) and the format `source_uuid:min(interval_start)-max(interval_end)`. For example: {% include_cached copy-clipboard.html %} ~~~ @@ -363,6 +379,16 @@ Before using this option, the source PostgreSQL or MySQL database **must** be co --replicator-flags "--defaultGTIDSet 'b7f9e0fa-2753-1e1f-5d9b-2402ac810003:3-21'" ~~~ +If replication is interrupted, specify the staging schema with the [`--stagingSchema` replication flag](#replication-flags). MOLT Fetch outputs the schema name as `staging database name: {schema_name}` after the initial run of `--mode replication-only`. + +{% include_cached copy-clipboard.html %} +~~~ +--mode replication-only +--replicator-flags "--stagingSchema {schema_name}" +~~~ + +You **must** include the `--stagingSchema` replication flag when resuming replication, as the schema provides a replication marker for streaming changes. + To cancel replication, enter `ctrl-c` to issue a `SIGTERM` signal. This returns an exit code `0`. If replication fails, a non-zero exit code is returned. #### Export data to storage @@ -409,7 +435,7 @@ When running `molt fetch --mode failback`, `--source` is the CockroachDB connect ~~~ {{site.data.alerts.callout_info}} -MySQL 8.0 and later are supported as MySQL failback targets. +MySQL 5.7 and later are supported as MySQL failback targets. {{site.data.alerts.end}} ##### Changefeed override settings @@ -894,13 +920,13 @@ By default, MOLT Fetch exports [Prometheus](https://prometheus.io/) metrics at ` Cockroach Labs recommends monitoring the following metrics: -| Metric Name | Description | -|---------------------------------------|--------------------------------------------------------------------------------------------------------------------| -| `molt_fetch_num_tables` | Number of tables that will be moved from the source. | -| `molt_fetch_num_task_errors` | Number of errors encountered by the fetch task. | -| `molt_fetch_overall_duration` | Duration (in seconds) of the fetch task. | -| `molt_fetch_rows_exported` | Number of rows that have been exported from a table. For example:
`molt_fetch_rows_exported{table="public.users"}` | -| `molt_fetch_rows_imported` | Number of rows that have been imported from a table. For example:
`molt_fetch_rows_imported{table="public.users"}` | +| Metric Name | Description | +|---------------------------------------|-----------------------------------------------------------------------------------------------------------------------------| +| `molt_fetch_num_tables` | Number of tables that will be moved from the source. | +| `molt_fetch_num_task_errors` | Number of errors encountered by the fetch task. | +| `molt_fetch_overall_duration` | Duration (in seconds) of the fetch task. | +| `molt_fetch_rows_exported` | Number of rows that have been exported from a table. For example:
`molt_fetch_rows_exported{table="public.users"}` | +| `molt_fetch_rows_imported` | Number of rows that have been imported from a table. For example:
`molt_fetch_rows_imported{table="public.users"}` | | `molt_fetch_table_export_duration_ms` | Duration (in milliseconds) of a table's export. For example:
`molt_fetch_table_export_duration_ms{table="public.users"}` | | `molt_fetch_table_import_duration_ms` | Duration (in milliseconds) of a table's import. For example:
`molt_fetch_table_import_duration_ms{table="public.users"}` | diff --git a/src/current/releases/molt.md b/src/current/releases/molt.md index a3e729318cc..1a1a52e189a 100644 --- a/src/current/releases/molt.md +++ b/src/current/releases/molt.md @@ -18,6 +18,18 @@ To download the latest MOLT Fetch/Verify binary: {% include molt/molt-install.md %} +## December 13, 2024 + +MOLT Fetch/Verify 1.2.1 is [available](#installation). + +- MOLT Fetch users now can use [`--assume-role`]({% link molt/molt-fetch.md %}#global-flags) to specify a service account for assume role authentication to cloud storage. `--assume-role` must be used with `--use-implicit-auth`, or it will be ignored. +- MySQL 5.7 and later are now supported with MOLT Fetch replication modes. For details on setup, refer to the [MOLT Fetch documentation]({% link molt/molt-fetch.md %}#replication-setup). +- Fetch replication mode now defaults to a less verbose `INFO` logging level. To specify `DEBUG` logging, pass in the `--replicator-flags '-v'` setting, or `--replicator-flags '-vv'` for trace logging. +- MySQL columns of type `BIGINT UNSIGNED` or `SERIAL` are now auto-mapped to [`DECIMAL`]({% link {{ site.current_cloud_version }}/decimal.md %}) type in CockroachDB. MySQL regular `BIGINT` types are mapped to [`INT`]({% link {{ site.current_cloud_version }}/int.md %}) type in CockroachDB. +- The `pglogical` replication workflow was modified in order to enforce safer and simpler defaults for the [`data-load`]({% link molt/molt-fetch.md %}#load-data), [`data-load-and-replication`]({% link molt/molt-fetch.md %}#load-data-and-replicate-changes), and [`replication-only`]({% link molt/molt-fetch.md %}#replicate-changes) workflows for PostgreSQL sources. Fetch now ensures that the publication is created before the slot, and that `replication-only` defaults to using publications and slots created either in previous Fetch runs or manually. +- Fixed scan iterator query ordering for `BINARY` and `TEXT` (of same collation) PKs so that they lead to the correct queries and ordering. +- For a MySQL source in [`replication-only`]({% link molt/molt-fetch.md %}#replicate-changes) mode, the [`--stagingSchema` replicator flag]({% link molt/molt-fetch.md %}#replication-flags) can now be used to resume streaming replication after being interrupted. Otherwise, the [`--defaultGTIDSet` replicator flag]({% link molt/molt-fetch.md %}#mysql-replication-flags) is used to start initial replication after a previous Fetch run in [`data-load`]({% link molt/molt-fetch.md %}#load-data) mode, or as an override to the current replication stream. + ## October 29, 2024 MOLT Fetch/Verify 1.2.0 is [available](#installation).