overview and monitoring pages

cockroachdb · Nov 5, 2024 · 629ac83 · 629ac83
1 parent fdb7802
commit 629ac83
Show file tree

Hide file tree

Showing 7 changed files with 1,385 additions and 1 deletion.
diff --git a/src/current/_includes/v24.3/ldr/multiple-tables.md b/src/current/_includes/v24.3/ldr/multiple-tables.md
@@ -1 +1 @@
-There are some tradeoffs between enabling one table per LDR job versus multiple tables in one LDR job. Multiple tables in one LDR job can be easier to operate. For example, if you pause and resume the single job, LDR will stop and resume for all the tables. However, the most granular level observability will be at the job level. One table in one LDR job will allow for table-level observability.
+There are some tradeoffs between enabling one table per LDR job versus multiple tables in one LDR job. Multiple tables in one LDR job can be easier to operate. For example, if you pause and resume the single job, LDR will stop and resume for all the tables. However, the most granular level observability will be at the job level. One table in one LDR job will allow for table-level observability.
diff --git a/src/current/_includes/v24.3/ldr/show-logical-replication-responses.md b/src/current/_includes/v24.3/ldr/show-logical-replication-responses.md
@@ -0,0 +1,8 @@
+Field    | Response
+---------+----------
+`job_id` | The job's ID. Use with [`CANCEL JOB`]({% link {{ page.version.version }}/cancel-job.md %}), [`PAUSE JOB`]({% link {{ page.version.version }}/pause-job.md %}), [`RESUME JOB`]({% link {{ page.version.version }}/resume-job.md %}), [`SHOW JOB`]({% link {{ page.version.version }}/show-jobs.md %}).
+`status` | Status of the job `running`, `paused`, `canceled`. {% comment  %}check these{% endcomment %}
+`targets` | The fully qualified name of the table(s) that are part of the LDR job.
+`replicated_time` | The latest timestamp at which the destination cluster has consistent data. This time advances automatically as long as the LDR job proceeds without error. `replicated_time` is updated periodically (every 30s). {% comment %}To confirm this line is accurate{% endcomment %}
+`replication_start_time` | The start time of the LDR job.
+`conflict_resolution_type` | The type of [conflict resolution]({% link {{ page.version.version }}/manage-logical-data-replication.md %}#conflict-resolution): `LWW` last write wins.
diff --git a/src/current/_includes/v24.3/sidebar-data/cross-cluster-replication.json b/src/current/_includes/v24.3/sidebar-data/cross-cluster-replication.json
@@ -5,6 +5,12 @@
       {
         "title": "Logical Data Replication",
         "items": [
+            {
+                "title": "Overview",
+                "urls": [
+                  "/${VERSION}/logical-data-replication-overview.html"
+                ]
+            },
             {
                 "title": "Set Up Logical Data Replication",
                 "urls": [
@@ -16,6 +22,12 @@
                 "urls": [
                     "/${VERSION}/manage-logical-data-replication.html"
                 ]
+            },
+            {
+                "title": "Monitor Logical Data Replication",
+                "urls": [
+                    "/${VERSION}/logical-data-replication-monitoring.html"
+                ]
             }
         ]
       }

diff --git a/src/current/images/v24.3/east-west-region.svg b/src/current/images/v24.3/east-west-region.svg
diff --git a/src/current/images/v24.3/unidirectional.svg b/src/current/images/v24.3/unidirectional.svg
diff --git a/src/current/v24.3/logical-data-replication-monitoring.md b/src/current/v24.3/logical-data-replication-monitoring.md
@@ -0,0 +1,149 @@
+---
+title: Logical Data Replication Monitoring
+summary: Monitor and observe LDR jobs between a source and destination table.
+toc: true
+docs_area: manage
+---
+
+{{site.data.alerts.callout_info}}
+{% include feature-phases/preview.md %}
+{{site.data.alerts.end}}
+
+You can monitor [**logical data replication (LDR)**]({% link {{ page.version.version }}/logical-data-replication-overview.md %}) using:
+
+- [`SHOW LOGICAL REPLICATION JOBS`](#sql-shell) in the SQL shell to view a list of LDR jobs on the cluster.
+- The **Logical Data Replication** dashboard on the [DB Console](#db-console) to view metrics at the cluster level. {% comment %}To add link later to dashboard page{% endcomment %}
+- [Prometheus and Alertmanager](#prometheus) to track and alert on LDR metrics.
+- Metrics export with [Datadog](#datadog).
+- [Metrics labels](#metrics-labels) to view metrics at the job level.
+
+{{site.data.alerts.callout_info}}
+{% include {{ page.version.version }}/ldr/multiple-tables.md %}
+{{site.data.alerts.end}}
+
+{% comment  %}To add to an include{% endcomment %}
+When you start LDR, one job is created on each cluster:
+
+- The _history retention job_ on the source cluster, which runs while the LDR job is active to protect changes in the table from [garbage collection]({% link {{ page.version.version }}/architecture/storage-layer.md %}#garbage-collection) until they have been applied to the destination cluster. The history retention job is viewable in the [DB Console](#db-console) or with [`SHOW JOBS`]({% link {{ page.version.version }}/show-jobs.md %}). Any manual changes to the history retention job could disrupt the LDR job.
+- The `logical replication` job on the destination cluster. You can view the status of this job in the SQL shell with `SHOW LOGICAL REPLICATION JOBS` and the DB Console [**Jobs** page](#jobs-page).
+
+## SQL Shell
+
+In the destination cluster's SQL shell, you can query `SHOW LOGICAL REPLICATION JOBS` to view the LDR jobs running on the cluster:
+
+{% include_cached copy-clipboard.html %}
+~~~ sql
+SHOW LOGICAL REPLICATION JOBS;
+~~~
+~~~
+        job_id        | status  |          targets          | replicated_time
+----------------------+---------+---------------------------+------------------
+1012877040439033857   | running | {database.public.table}   | NULL
+(1 row)
+~~~
+
+For additional detail on each LDR job, use the `WITH details` option:
+
+{% include_cached copy-clipboard.html %}
+~~~ sql
+SHOW LOGICAL REPLICATION JOBS WITH details;
+~~~
+~~~
+        job_id        |  status  |            targets             |        replicated_time        |    replication_start_time     | conflict_resolution_type |                                      description
+----------------------+----------+--------------------------------+-------------------------------+-------------------------------+--------------------------+-----------------------------------------------------------------------------------------
+  1010959260799270913 | running  | {movr.public.promo_codes}      | 2024-10-24 17:50:05+00        | 2024-10-10 20:04:42.196982+00 | LWW                      | LOGICAL REPLICATION STREAM into movr.public.promo_codes from external://cluster_a
+  1014047902397333505 | canceled | {defaultdb.public.office_dogs} | 2024-10-24 17:30:25+00        | 2024-10-21 17:54:20.797643+00 | LWW                      | LOGICAL REPLICATION STREAM into defaultdb.public.office_dogs from external://cluster_a
+~~~
+
+### Responses
+
+{% include {{ page.version.version }}/ldr/show-logical-replication-responses.md %}
+
+## Recommended LDR metrics to track
+
+- Replication latency: The commit-to-commit replication latency. A _commit_ is when the LDR job either adds a row to the [dead letter queue (DLQ)]({% link {{ page.version.version }}/manage-logical-data-replication.md %}#dead-letter-queue-dlq) or applies a row successfully to the destination cluster.
+    - `logical_replication.commit_latency-p50`
+    - `logical_replication.commit_latency-p99`
+- Replication lag: How far behind the source cluster is from the destination cluster at a specific point in time. The replication lag is equivalent to [RPO]({% link {{ page.version.version }}/disaster-recovery-overview.md %}) during a disaster.
+    - `logical_replication.replicated_time_seconds`
+- Row updates applied: These metrics indicate whether the destination cluster is actively receiving and applying data from the source cluster.
+    - `logical_replication.events_ingested`
+    - `logical_replication.events_dlqed`
+- Events dead letter queued: How often the LDR job is putting writes in the DLQ because they cannot be applied successfully on the destination cluster.
+    - `logical_replication.events_dlqed_age`
+    - `logical_replication.events_dlqed_space`      
+    - `logical_replication.events_dlqed_errtype`
+
+## DB Console
+
+In the DB Console, you can use:
+
+- The [**Metrics** dashboard]({% link {{ page.version.version }}/ui-overview-dashboard.md %}) for LDR to view metrics for the job on the destination cluster.
+- The [**Jobs** page]({% link {{ page.version.version }}/ui-jobs-page.md %}) to view the history retention job on the source cluster and the LDR job on the destination cluster
+
+The metrics for LDR in the DB Console metrics are at the **cluster** level. This means that if there are multiple LDR jobs running on a cluster the DB Console will either show the average metrics across jobs.   
+
+### Metrics dashboard
+
+You can use the **Logical Data Replication** dashboard of the destination cluster to monitor the following metric graphs at the **cluster** level:
+
+- Replication latency
+- Replication lag
+- Row updates applied
+- Logical bytes reviewed
+- Batch application processing time: 50th percentile
+- Batch application processing time: 99th percentile
+- DLQ causes
+- Retry queue size
+
+{% comment  %}Dashboard page in the DB Console docs to be added with more information per other dashboards. Link to there from this section.{% endcomment %}
+
+To track replicated time, ingested events, and events added to the DLQ at the **job** level, refer to [Metrics labels](#metrics-labels).
+
+### Jobs page
+
+On the **Jobs** page, select:
+
+- The **Replication Producer** in the source cluster's DB Console to view the _history retention job_.
+- The **Logical Replication Ingestion** job in the destination cluster's DB Console. When you start LDR, the **Logical Replication Ingestion** job will show a bar that tracks the initial scan progress of the source table's existing data.
+
+## Monitoring and alerting
+
+### Prometheus
+
+You can use Prometheus and Alertmanager to track and alert on LDR metrics. Refer to the [Monitor CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}) tutorial for steps to set up Prometheus.
+
+#### Metrics labels
+
+To view metrics at the job level, you can use the `label` option when you start LDR to add a metrics label to the LDR job. This enables [child metric]({% link {{ page.version.version }}/child-metrics.md %}) export, which are Prometheus time series with extra labels. You can track the following metrics for an LDR job with labels:
+
+- `logical_replication.replicated_time_seconds`
+- `logical_replication.events_ingested`
+- `logical_replication.events_dlqed`
+
+To use metrics labels, ensure you have enabled the child metrics cluster setting:
+
+{% include_cached copy-clipboard.html %}
+~~~ sql
+SET CLUSTER SETTING server.child_metrics.enabled = true;
+~~~
+
+When you start LDR, include the `label` option:
+
+{% include_cached copy-clipboard.html %}
+~~~ sql
+CREATE LOGICAL REPLICATION STREAM FROM TABLE {database.public.table_name} 
+ON 'external://{source_external_connection}' 
+INTO TABLE {database.public.table_name} WITH label=ldr_job;
+~~~
+
+For a full reference on tracking metrics with labels, refer to the [Child Metrics]({% link {{ page.version.version }}/child-metrics.md %}) page.
+
+### Datadog
+
+You can export metrics to Datadog for LDR jobs. For steps to set up metrics export, refer to the [Monitor CockroachDB Self-Hosted with Datadog]({% link {{ page.version.version }}/datadog.md %}).
+
+## See also
+
+- [Set Up Logical Data Replication]({% link {{ page.version.version }}/set-up-logical-data-replication.md %})
+- [Managed Logical Data Replcation]({% link {{ page.version.version }}/manage-logical-data-replication.md %})
diff --git a/src/current/v24.3/logical-data-replication-overview.md b/src/current/v24.3/logical-data-replication-overview.md
@@ -0,0 +1,47 @@
+---
+title: Logical Data Replication
+summary: An overview of CockroachDB logical data replication (LDR).
+toc: true
+---
+
+{{site.data.alerts.callout_info}}
+{% include feature-phases/preview.md %}
+{{site.data.alerts.end}}
+
+{% include_cached new-in.html version="v24.3" %} **Logical data replication (LDR)** continuously replicates tables between active CockroachDB clusters. Application traffic can occur concurrently on both the source and destination clusters with the LDR job achieving eventual consistency in the replicating tables. The active-active setup can provide protection against cluster, datacenter, or region failure while still achieving single-region low latency reads and writes in the individual CockroachDB clusters. Each cluster in an LDR job still benefits individually from multi-active availability with CockroachDB's built-in Raft replication providing data consistency across nodes, zones, and regions.
+
+{{site.data.alerts.callout_success}}
+Cockroach Labs also has a [physical cluster replication]({% link {{ page.version.version }}/physical-cluster-replication-overview.md %}) tool that continuously sends data at the byte level from a primary cluster to an independent standby cluster.
+{{site.data.alerts.end}}
+
+## Use cases
+
+You can run LDR in a _unidirectional_ or _bidirectional_ setup to meet different use cases that support
+
+### Bidirectional LDR
+
+Maintain high availability with a two-datacenter topology. You can run bidirectional LDR to ensure [data resilience]({% link {{ page.version.version }}/data-resilience.md %}) in your deployment, particularly in region failure. Both clusters can receive application reads and writes with low, single-region write latency. In a datacenter or cluster outage, you can redirect application traffic to the surviving cluster with [low downtime]({% link {{ page.version.version }}/data-resilience.md %}#high-availability). In the following diagram, the clusters are deployed in US East and West to provide low latency for that region. The two LDR jobs ensure that the tables on both clusters will reach eventual consistency.
+
+<image src="{{ 'images/v24.3/east-west-region.svg' | relative_url }}" alt="Diagram showing bidirectional LDR from cluster A to B and back again from cluster B to A." style="width:50%" />
+
+### Unidirectional LDR
+
+Isolate critical application workloads from non-critical application workloads in a unidirectional setup. For example, you may want to run jobs like [changefeeds]({% link {{ page.version.version }}/change-data-capture-overview.md %}) or [backups]({% link {{ page.version.version }}/backup-and-restore-overview.md %}) from one cluster to isolate these jobs from the cluster receiving the principal application traffic.
+
+<image src="{{ 'images/v24.3/unidirectional.svg' | relative_url }}" alt="Diagram showing unidirectional LDR from a source cluster to a destination cluster with the destination cluster supporting secondary workloads plus jobs and the source cluster accepting the main application traffic." style="width:80%" />
+
+## Features
+
+- **Table-level replication**: When you initiate LDR, it will replicate all of the source table's existing data to the destination table. From then on, LDR will replicate the source table's data to the destination table to achieve eventual consistency.
+- **Last write wins conflict resolution**: LDR uses [_last write wins (LWW)_ conflict resolution]({% link {{ page.version.version }}/manage-logical-data-replication.md %}#conflict-resolution), which will use the latest MVCC timestamp to resolve a conflict in row insertion.
+- **Dead letter queue (DLQ)**: When LDR starts, the job will create a [DLQ table]({% link {{ page.version.version }}/manage-logical-data-replication.md %}#dead-letter-queue-dlq) with each replicating table in order to track unresolved conflicts. You can interact and manage this table like any other SQL table.
+- **Replication modes**: In LDR, you can use different _modes_ to apply different configurations to the replication. The modes will allow you to configure for throughput.
+- **Monitoring**: To monitor LDR's initial progress, current status, and performance, you can metrics available in the DB Console, Prometheus, and Metrics Export.
+
+## Get started
+
+- To set up unidirectional or bidirectional LDR, follow the [Set Up Logical Data Replication]({% link {{ page.version.version }}/set-up-logical-data-replication.md %}) tutorial.
+- Once you've set up LDR, use the [Manage Logical Data Replication]({% link {{ page.version.version }}/manage-logical-data-replication.md %}) page to coordinate and manage different parts of the job.
+- For an overview of metrics to track and monitoring tools, refer to the [Monitor Logical Data Replication]({% link {{ page.version.version }}/logical-data-replication-monitoring.md %}) page.
+
+{% comment  %}move known limitations to here after PR 1 merges{% endcomment %}