Add monitor latency section to group recommendations on changefeed la…

…tency (#18185)
cockroachdb · Jan 8, 2024 · 1a5547f · 1a5547f
1 parent f59d073
commit 1a5547f
Show file tree

Hide file tree

Showing 6 changed files with 96 additions and 20 deletions.
diff --git a/src/current/_includes/v23.1/cdc/lagging-ranges.md b/src/current/_includes/v23.1/cdc/lagging-ranges.md
@@ -0,0 +1,10 @@
+{% include_cached new-in.html version="v23.1.12" %} Use the `changefeed.lagging_ranges` metric to track the number of ranges that are behind in a changefeed. This is calculated based on the [cluster settings]({% link {{ page.version.version }}/cluster-settings.md %}):
+
+- `changefeed.lagging_ranges_threshold` sets a duration from the present that determines the length of time a range is considered to be lagging behind, which will then track in the [`lagging_ranges`]({% link {{ page.version.version }}/monitor-and-debug-changefeeds.md %}#using-changefeed-metrics-labels) metric. Note that ranges undergoing an [initial scan]({% link {{ page.version.version }}/create-changefeed.md %}#initial-scan) for longer than the threshold duration are considered to be lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric for each range in the table. As ranges complete the initial scan, the number of ranges lagging behind will decrease.
+    - **Default:** `3m`
+- `changefeed.lagging_ranges_polling_interval` sets the interval rate for when lagging ranges are checked and the `lagging_ranges` metric is updated. Polling adds latency to the `lagging_ranges` metric being updated. For example, if a range falls behind by 3 minutes, the metric may not update until an additional minute afterward.
+    - **Default:** `1m`
+
+{{site.data.alerts.callout_success}}
+You can use the [`metrics_label`]({% link {{ page.version.version }}/monitor-and-debug-changefeeds.md %}#using-changefeed-metrics-labels) option to track the `lagging_ranges` metric per changefeed.
+{{site.data.alerts.end}}
diff --git a/src/current/_includes/v23.2/cdc/lagging-ranges.md b/src/current/_includes/v23.2/cdc/lagging-ranges.md
@@ -0,0 +1,10 @@
+{% include_cached new-in.html version="v23.2" %} Use the `changefeed.lagging_ranges` metric to track the number of ranges that are behind in a changefeed. This is calculated based on the [changefeed options]({% link {{ page.version.version }}/create-changefeed.md %}#options):
+
+- `lagging_ranges_threshold` sets a duration from the present that determines the length of time a range is considered to be lagging behind, which will then track in the [`lagging_ranges`]({% link {{ page.version.version }}/monitor-and-debug-changefeeds.md %}#lagging-ranges-metric) metric. Note that ranges undergoing an [initial scan]({% link {{ page.version.version }}/create-changefeed.md %}#initial-scan) for longer than the threshold duration are considered to be lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric for each range in the table. As ranges complete the initial scan, the number of ranges lagging behind will decrease.
+    - **Default:** `3m`
+- `lagging_ranges_polling_interval` sets the interval rate for when lagging ranges are checked and the `lagging_ranges` metric is updated. Polling adds latency to the `lagging_ranges` metric being updated. For example, if a range falls behind by 3 minutes, the metric may not update until an additional minute afterward.
+    - **Default:** `1m`
+
+{{site.data.alerts.callout_success}}
+You can use the [`metrics_label`]({% link {{ page.version.version }}/monitor-and-debug-changefeeds.md %}#using-changefeed-metrics-labels) option to track the `lagging_ranges` metric per changefeed.
+{{site.data.alerts.end}}
diff --git a/src/current/v23.1/advanced-changefeed-configuration.md b/src/current/v23.1/advanced-changefeed-configuration.md
@@ -71,16 +71,7 @@ For example, if you have a large table, and one of the nodes in the cluster is h
 
 ### Lagging ranges
 
-{% include_cached new-in.html version="v23.1.12" %} Use the `changefeed.lagging_ranges` metric to track the number of ranges that are behind in a changefeed. This is calculated based on the [cluster settings]({% link {{ page.version.version }}/cluster-settings.md %}):
-
-- `changefeed.lagging_ranges_threshold` sets a duration from the present that determines the length of time a range is considered to be lagging behind, which will then track in the [`lagging_ranges`]({% link {{ page.version.version }}/monitor-and-debug-changefeeds.md %}#using-changefeed-metrics-labels) metric. Note that ranges undergoing an [initial scan]({% link {{ page.version.version }}/create-changefeed.md %}#initial-scan) for longer than the threshold duration are considered to be lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric for each range in the table. As ranges complete the initial scan, the number of ranges lagging behind will decrease.
-    - **Default:** `3m`
-- `changefeed.lagging_ranges_polling_interval` sets the interval rate for when lagging ranges are checked and the `lagging_ranges` metric is updated. Polling adds latency to the `lagging_ranges` metric being updated. For example, if a range falls behind by 3 minutes, the metric may not update until an additional minute afterward.
-    - **Default:** `1m`
-
-{{site.data.alerts.callout_success}}
-You can use the [`metrics_label`]({% link {{ page.version.version }}/monitor-and-debug-changefeeds.md %}#using-changefeed-metrics-labels) option to track the `lagging_ranges` metric per changefeed.
-{{site.data.alerts.end}}
+{% include {{ page.version.version }}/cdc/lagging-ranges.md %}
 
 ## Tuning for high durability delivery
 

diff --git a/src/current/v23.1/monitor-and-debug-changefeeds.md b/src/current/v23.1/monitor-and-debug-changefeeds.md
@@ -157,6 +157,43 @@ changefeed_emitted_bytes{scope="vehicles"} 183557
 `backfill_pending_ranges` | Number of [ranges]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-range) in an ongoing backfill that are yet to be fully emitted. | Ranges
 `message_size_hist` | Distribution in the size of emitted messages. | Bytes
 
+### Monitoring and measuring changefeed latency
+
+Changefeeds can encounter latency in events processing. This latency is the total time CockroachDB takes to:
+
+- Commit writes to the database.
+- Encode [changefeed messages]({% link {{ page.version.version }}/changefeed-messages.md %}).
+- Deliver the message to the [sink]({% link {{ page.version.version }}/changefeed-sinks.md %}).
+
+There are a couple of ways to measure if changefeeds are encountering latency or falling behind:
+
+- [Event latency](#event-latency): Measure the difference between an event's MVCC timestamp and when it is put into the memory buffer or acknowledged at the sink.
+- [Lagging ranges](#lagging-ranges): Track the number of [ranges]({% link {{ page.version.version }}/architecture/overview.md %}#range) that are behind in a changefeed.
+
+#### Event latency
+
+To monitor for changefeeds encountering latency in how events are emitting, track the following metrics:
+
+- `admit_latency`: The difference between the event's MVCC timestamp and the time the event is put into the memory buffer.
+- `commit_latency`: The difference between the event's MVCC timestamp and the time it is acknowledged by the [downstream sink]({% link {{ page.version.version }}/changefeed-sinks.md %}). If the sink is batching events, the difference is between the oldest event and when the acknowledgment is recorded.
+
+{{site.data.alerts.callout_info}}
+The `admit_latency` and `commit_latency` metrics do **not** update for backfills during [initial scans]({% link {{ page.version.version }}/create-changefeed.md %}#initial-scan) or [backfills for schema changes]({% link {{ page.version.version }}/changefeed-messages.md %}#schema-changes-with-column-backfill). This is because a full table scan may contain rows that were written far in the past, which would lead to inaccurate changefeed latency measurements if the events from these scans were included in the metrics.
+{{site.data.alerts.end}}
+
+Both of these metrics support [metrics labels](#using-changefeed-metrics-labels). You can set the `metrics_label` option when starting a changefeed to differentiate metrics per changefeed.
+
+We recommend using the p99 `commit_latency` aggregation for alerting and to set SLAs for your changefeeds. You can add these metrics (e.g., `changefeed.admit_latency-p90`) to a custom chart through the [DB Console]({% link {{ page.version.version }}/ui-overview.md %}), refer to the [Customer Chart debug page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}). Or, you can track with [Prometheus]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#prometheus-endpoint).
+
+If your changefeed is experiencing elevated latency, you can use these metrics to:
+
+- Review `admit_latency` versus `commit_latency` to calculate the time events are moving from the memory buffer to the downstream sink.
+- Compare the `commit_latency` P99, P90, P50 latency percentiles to investigate performance over time.
+
+#### Lagging ranges
+
+{% include {{ page.version.version }}/cdc/lagging-ranges.md %}
+
 ## Debug a changefeed
 
 ### Using logs

diff --git a/src/current/v23.2/advanced-changefeed-configuration.md b/src/current/v23.2/advanced-changefeed-configuration.md
@@ -71,16 +71,7 @@ For example, if you have a large table, and one of the nodes in the cluster is h
 
 ### Lagging ranges
 
-{% include_cached new-in.html version="v23.2" %} Use the `changefeed.lagging_ranges` metric to track the number of ranges that are behind in a changefeed. This is calculated based on the [changefeed options]({% link {{ page.version.version }}/create-changefeed.md %}#options):
-
-- `lagging_ranges_threshold` sets a duration from the present that determines the length of time a range is considered to be lagging behind, which will then track in the [`lagging_ranges`]({% link {{ page.version.version }}/monitor-and-debug-changefeeds.md %}#lagging-ranges-metric) metric. Note that ranges undergoing an [initial scan]({% link {{ page.version.version }}/create-changefeed.md %}#initial-scan) for longer than the threshold duration are considered to be lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric for each range in the table. As ranges complete the initial scan, the number of ranges lagging behind will decrease.
-    - **Default:** `3m`
-- `lagging_ranges_polling_interval` sets the interval rate for when lagging ranges are checked and the `lagging_ranges` metric is updated. Polling adds latency to the `lagging_ranges` metric being updated. For example, if a range falls behind by 3 minutes, the metric may not update until an additional minute afterward.
-    - **Default:** `1m`
-
-{{site.data.alerts.callout_success}}
-You can use the [`metrics_label`]({% link {{ page.version.version }}/monitor-and-debug-changefeeds.md %}#using-changefeed-metrics-labels) option to track the `lagging_ranges` metric per changefeed.
-{{site.data.alerts.end}}
+{% include {{ page.version.version }}/cdc/lagging-ranges.md %}
 
 ## Tuning for high durability delivery
 

diff --git a/src/current/v23.2/monitor-and-debug-changefeeds.md b/src/current/v23.2/monitor-and-debug-changefeeds.md
@@ -157,6 +157,43 @@ changefeed_emitted_bytes{scope="vehicles"} 183557
 `backfill_pending_ranges` | Number of [ranges]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-range) in an ongoing backfill that are yet to be fully emitted. | Ranges
 `message_size_hist` | Distribution in the size of emitted messages. | Bytes
 
+### Monitoring and measuring changefeed latency
+
+Changefeeds can encounter latency in events processing. This latency is the total time CockroachDB takes to:
+
+- Commit writes to the database.
+- Encode [changefeed messages]({% link {{ page.version.version }}/changefeed-messages.md %}).
+- Deliver the message to the [sink]({% link {{ page.version.version }}/changefeed-sinks.md %}).
+
+There are a couple of ways to measure if changefeeds are encountering latency or falling behind:
+
+- [Event latency](#event-latency): Measure the difference between an event's MVCC timestamp and when it is put into the memory buffer or acknowledged at the sink.
+- [Lagging ranges](#lagging-ranges): Track the number of [ranges]({% link {{ page.version.version }}/architecture/overview.md %}#range) that are behind in a changefeed.
+
+#### Event latency
+
+To monitor for changefeeds encountering latency in how events are emitting, track the following metrics:
+
+- `admit_latency`: The difference between the event's MVCC timestamp and the time the event is put into the memory buffer.
+- `commit_latency`: The difference between the event's MVCC timestamp and the time it is acknowledged by the [downstream sink]({% link {{ page.version.version }}/changefeed-sinks.md %}). If the sink is batching events, the difference is between the oldest event and when the acknowledgment is recorded.
+
+{{site.data.alerts.callout_info}}
+The `admit_latency` and `commit_latency` metrics do **not** update for backfills during [initial scans]({% link {{ page.version.version }}/create-changefeed.md %}#initial-scan) or [backfills for schema changes]({% link {{ page.version.version }}/changefeed-messages.md %}#schema-changes-with-column-backfill). This is because a full table scan may contain rows that were written far in the past, which would lead to inaccurate changefeed latency measurements if the events from these scans were included in `admit_latency` adn `commit_latency`.
+{{site.data.alerts.end}}
+
+Both of these metrics support [metrics labels](#using-changefeed-metrics-labels). You can set the `metrics_label` option when starting a changefeed to differentiate metrics per changefeed.
+
+We recommend using the p99 `commit_latency` aggregation for alerting and to set SLAs for your changefeeds. Refer to the [Changefeed Dashboard]({% link {{ page.version.version }}/ui-cdc-dashboard.md %}) **Commit Latency** graph to track this metric in the [DB Console]({% link {{ page.version.version }}/ui-overview.md %}).
+
+If your changefeed is experiencing elevated latency, you can use these metrics to:
+
+- Review `admit_latency` versus `commit_latency` to calculate the time events are moving from the memory buffer to the downstream sink.
+- Compare the `commit_latency` P99, P90, P50 latency percentiles to investigate performance over time.
+
+#### Lagging ranges
+
+{% include {{ page.version.version }}/cdc/lagging-ranges.md %}
+
 ## Debug a changefeed
 
 ### Using logs