Skip to content

Commit

Permalink
DOC-9513 Serverless Canned Metrics
Browse files Browse the repository at this point in the history
DOC-9632 Serverless Essential Metrics
  • Loading branch information
florence-crl committed Feb 12, 2024
1 parent 79869e4 commit cb48172
Show file tree
Hide file tree
Showing 71 changed files with 3,472 additions and 22 deletions.
2,334 changes: 2,334 additions & 0 deletions src/current/_data/metrics-list.csv

Large diffs are not rendered by default.

730 changes: 730 additions & 0 deletions src/current/_data/metrics.yml

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This metric provides a useful context when assessing the state of changefeeds. This metric characterizes the end-to-end lag between a committed change and that change applied at the destination.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This metric provides a useful context when assessing the state of changefeeds. This metric characterizes the throughput bytes being streamed from the CockroachDB cluster.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This metric provides a useful context when assessing the state of changefeeds. This metric characterizes the rate of changes being streamed from the CockroachDB cluster.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This metric tracks transient changefeed errors. Alert on "too many" errors, such as 50 retries in 15 minutes. For example, during a rolling upgrade this counter will increase because the changefeed jobs will restart following node restarts. There is an exponential backoff, up to 10 minutes. But if there is no rolling upgrade in process or other cluster maintenance, and the error rate is high, investigate the changefeed job.
1 change: 1 addition & 0 deletions src/current/_includes/metrics-usage/changefeed.failures.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This metric tracks the permanent changefeed job failures that the jobs system will not try to restart. Any increase in this counter should be investigated. An alert on this metric is recommended.
1 change: 1 addition & 0 deletions src/current/_includes/metrics-usage/changefeed.running.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This metric tracks the total number of all running changefeeds.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This metric is a high-level indicator that automatically generated statistics jobs are paused which can lead to the query optimizer running with stale statistics. Stale statistics can cause suboptimal query plans to be selected leading to poor query performance.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This metric tracks the number of active automatically generated statistics jobs that could also be consuming resources. Ensure that foreground SQL traffic is not impacted by correlating this metric with SQL latency and query volume metrics.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This metric is a high-level indicator that automatically generated <a href="https://www.cockroachlabs.com/docs/stable/cost-based-optimizer#table-statistics">table statistics</a> is failing. Failed statistic creation can lead to the query optimizer running with stale statistics. Stale statistics can cause suboptimal query plans to be selected leading to poor query performance.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Monitor and alert on this metric to safeguard against an inadvertent operational error of leaving a changefeed job in a paused state for an extended period of time. Changefeed jobs should not be paused for a long time because the <a href="https://www.cockroachlabs.com/docs/stable/monitor-and-debug-changefeeds#protected-timestamp-and-garbage-collection-monitoring">protected timestamp prevents garbage collection</a>.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
<a href="https://www.cockroachlabs.com/docs/stable/monitor-and-debug-changefeeds#protected-timestamp-and-garbage-collection-monitoring">Changefeeds use protected timestamps to protect the data from being garbage collected</a>. Ensure the protected timestamp age does not significantly exceed the <a href="https://www.cockroachlabs.com/docs/stable/configure-replication-zones#replication-zone-variables">GC TTL zone configuration</a>. Alert on this metric if the protected timestamp age is greater than 3 times the GC TTL.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This metric tracks the number of active create statistics jobs that may be consuming resources. Ensure that foreground SQL traffic is not impacted by correlating this metric with SQL latency and query volume metrics.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Monitor this metric to ensure the Row-Level TTL job does not remain paused inadvertently for an extended period.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Monitor this metric to ensure there are not too many Row Level TTL jobs running at the same time. Generally, this metric should be in the low single digits.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
See Description.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
If Row-Level TTL is enabled, this metric should be nonzero and correspond to the `ttl_cron` setting that was chosen. If this metric is zero, it means the job is not running.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This metric should remain at zero. Repeated errors means the Row-Level TTL job is not deleting data.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Correlate this metric with the metric `jobs.row_level_ttl.rows_selected` to ensure all the rows that should be deleted are actually getting deleted.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Correlate this metric with the metric `jobs.row_level_ttl.rows_deleted` to ensure all the rows that should be deleted are actually getting deleted.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
See Description.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
See Description.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
See Description.
1 change: 1 addition & 0 deletions src/current/_includes/metrics-usage/livebytes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The amount of data being stored in the cluster. This is the logical number of live bytes and does not account for compression or replication.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Monitor this metric to ensure the Row-Level TTL job is running. If it is non-zero, it means the job could not be created.
3 changes: 3 additions & 0 deletions src/current/_includes/metrics-usage/sql.conn.latency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Connection latency is calculated as the time in nanoseconds between when the cluster receives a connection request and establishes the connection to the client, including <a href="https://www.cockroachlabs.com/docs/cockroachcloud/authentication">authentication</a>. This graph shows the p90 and p99 latencies for <a href="https://www.cockroachlabs.com/docs/stable/show-sessions">SQL connections</a> to the cluster.

These metrics characterize the database connection latency which can affect the application performance, for example, by having slow startup times.
5 changes: 5 additions & 0 deletions src/current/_includes/metrics-usage/sql.conns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
This metric shows the total number of SQL <a href="https://www.cockroachlabs.com/docs/stable/show-sessions">client connections</a> across the cluster.

Refer to the <a href="https://www.cockroachlabs.com/docs/cockroachcloud/sessions-page"><b>Sessions</b> page</a> for more details on the sessions.

This metric also shows the distribution, or balancing, of connections across the cluster. Review <a href="https://www.cockroachlabs.com/docs/stable/connection-pooling">Connection Pooling</a>.
1 change: 1 addition & 0 deletions src/current/_includes/metrics-usage/sql.ddl.count.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the <a href="https://www.cockroachlabs.com/docs/stable/monitoring-and-alerting#sql-activity-pages"><b>SQL Activity</b> pages</a> to investigate interesting outliers or patterns. For example, on the <a href="https://www.cockroachlabs.com/docs/cockroachcloud/transactions-page"><b>Transactions</b> page</a> and the <a href="https://www.cockroachlabs.com/docs/cockroachcloud/statements-page"><b>Statements</b> page</a>, sort on the Execution Count column. To find problematic sessions, on the <a href="https://www.cockroachlabs.com/docs/cockroachcloud/transactions-page"><b>Sessions</b> page</a>, sort on the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application.
1 change: 1 addition & 0 deletions src/current/_includes/metrics-usage/sql.delete.count.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the <a href="https://www.cockroachlabs.com/docs/stable/monitoring-and-alerting#sql-activity-pages"><b>SQL Activity</b> pages</a> to investigate interesting outliers or patterns. For example, on the <a href="https://www.cockroachlabs.com/docs/cockroachcloud/transactions-page"><b>Transactions</b> page</a> and the <a href="https://www.cockroachlabs.com/docs/cockroachcloud/statements-page"><b>Statements</b> page</a>, sort on the Execution Count column. To find problematic sessions, on the <a href="https://www.cockroachlabs.com/docs/cockroachcloud/transactions-page"><b>Sessions</b> page</a>, sort on the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The total number of SQL statements that experienced contention.
1 change: 1 addition & 0 deletions src/current/_includes/metrics-usage/sql.failure.count.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This metric is a high-level indicator of workload and application degradation with query failures. Use the <a href="https://www.cockroachlabs.com/docs/cockroachcloud/insights-page"><b>Insights</b> page</a> to find failed executions with their error code to troubleshoot or use application-level logs, if instrumented, to determine the cause of error.
1 change: 1 addition & 0 deletions src/current/_includes/metrics-usage/sql.full.scan.count.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This metric is a high-level indicator of potentially suboptimal query plans in the workload that may require index tuning and maintenance. To identify the <a href="https://www.cockroachlabs.com/docs/stable/performance-recipes#statements-with-full-table-scans">statements with a full table scan</a>, use `SHOW FULL TABLE SCAN` or the <a href="https://www.cockroachlabs.com/docs/cockroachcloud/statements-page"><b>SQL Activity Statements</b> page</a> with the corresponding metric time frame. The <b>Statements</b> page also includes <a href="https://www.cockroachlabs.com/docs/cockroachcloud/statements-page#explain-plans">explain plans</a> and <a href="https://www.cockroachlabs.com/docs/cockroachcloud/statements-page#insights">index recommendations</a>. Not all full scans are necessarily bad especially over smaller tables.
1 change: 1 addition & 0 deletions src/current/_includes/metrics-usage/sql.insert.count.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the <a href="https://www.cockroachlabs.com/docs/stable/monitoring-and-alerting#sql-activity-pages"><b>SQL Activity</b> pages</a> to investigate interesting outliers or patterns. For example, on the <a href="https://www.cockroachlabs.com/docs/cockroachcloud/transactions-page"><b>Transactions</b> page</a> and the <a href="https://www.cockroachlabs.com/docs/cockroachcloud/statements-page"><b>Statements</b> page</a>, sort on the Execution Count column. To find problematic sessions, on the <a href="https://www.cockroachlabs.com/docs/cockroachcloud/transactions-page"><b>Sessions</b> page</a>, sort on the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application.
1 change: 1 addition & 0 deletions src/current/_includes/metrics-usage/sql.new_conns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The rate of this metric shows how frequently new connections are being established. This can be useful in determining if a high rate of incoming new connections is causing additional load on the server due to a misconfigured application.
1 change: 1 addition & 0 deletions src/current/_includes/metrics-usage/sql.select.count.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the <a href="https://www.cockroachlabs.com/docs/stable/monitoring-and-alerting#sql-activity-pages"><b>SQL Activity</b> pages</a> to investigate interesting outliers or patterns. For example, on the <a href="https://www.cockroachlabs.com/docs/cockroachcloud/transactions-page"><b>Transactions</b> page</a> and the <a href="https://www.cockroachlabs.com/docs/cockroachcloud/statements-page"><b>Statements</b> page</a>, sort on the Execution Count column. To find problematic sessions, on the <a href="https://www.cockroachlabs.com/docs/cockroachcloud/transactions-page"><b>Sessions</b> page</a>, sort on the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application.
1 change: 1 addition & 0 deletions src/current/_includes/metrics-usage/sql.service.latency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
These high-level metrics reflect workload performance. Monitor these metrics to understand latency over time. If abnormal patterns emerge, apply the metric's time range to the <a href="https://www.cockroachlabs.com/docs/stable/monitoring-and-alerting#sql-activity-pages"><b>SQL Activity</b> pages</a> to investigate interesting outliers or patterns. The <a href="https://www.cockroachlabs.com/docs/cockroachcloud/statements-page"><b>Statements</b> page</a> has P90 Latency and P99 latency columns to enable correlation with this metric.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This high-level metric reflects workload volume.
1 change: 1 addition & 0 deletions src/current/_includes/metrics-usage/sql.txn.abort.count.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This high-level metric reflects workload performance. A persistently high number of SQL transaction abort errors may negatively impact the workload performance and needs to be investigated.
1 change: 1 addition & 0 deletions src/current/_includes/metrics-usage/sql.txn.begin.count.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This metric reflects workload volume by counting explicit <a href="https://www.cockroachlabs.com/docs/stable/transactions">transactions</a>. Use this metric to determine whether explicit transactions can be refactored as implicit transactions (individual statements).
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This metric shows the number of <a href="https://www.cockroachlabs.com/docs/stable/transactions">transactions</a> that completed successfully. This metric can be used as a proxy to measure the number of successful explicit transactions.
1 change: 1 addition & 0 deletions src/current/_includes/metrics-usage/sql.txn.latency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Over the last minute, this cluster executed 90% or 99% of transactions within this time. This time does not include network latency between the cluster and client. These metrics provide an overview of the current SQL workload.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This metric shows the number of orderly transaction <a href="https://www.cockroachlabs.com/docs/stable/rollback-transaction">transactions</a>. A persistently high number of rollbacks may negatively impact the workload performance and needs to be investigated.
1 change: 1 addition & 0 deletions src/current/_includes/metrics-usage/sql.txns.open.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This metric should roughly correspond to the number of cores * 4. If this metric is consistently larger, scale out the cluster.
1 change: 1 addition & 0 deletions src/current/_includes/metrics-usage/sql.update.count.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the <a href="https://www.cockroachlabs.com/docs/stable/monitoring-and-alerting#sql-activity-pages"><b>SQL Activity</b> pages</a> to investigate interesting outliers or patterns. For example, on the <a href="https://www.cockroachlabs.com/docs/cockroachcloud/transactions-page"><b>Transactions</b> page</a> and the <a href="https://www.cockroachlabs.com/docs/cockroachcloud/statements-page"><b>Statements</b> page</a>, sort on the Execution Count column. To find problematic sessions, on the <a href="https://www.cockroachlabs.com/docs/cockroachcloud/transactions-page"><b>Sessions</b> page</a>, sort on the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The number of RUs that were consumed due to cross-region networking. Correlate these metrics with Request Units (RUs). <a href="https://www.cockroachlabs.com/docs/cockroachcloud/plan-your-cluster-serverless#multi-region-clusters">Learn more</a>.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The number of RUs that were consumed because of byte traffic to the client and cluster bulk I/O operations (e.g., CDC). Correlate this metric with Request Units (RUs). <a href="https://www.cockroachlabs.com/docs/cockroachcloud/serverless-resource-usage">Learn more</a>.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The number of RUs that were consumed because of byte traffic to the client and cluster bulk I/O operations (e.g., CDC). Correlate this metric with Request Units (RUs). <a href="https://www.cockroachlabs.com/docs/cockroachcloud/serverless-resource-usage">Learn more</a>.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The number of RUs that were consumed due to KV reads, broken down by requests, batches, and bytes. SQL statements are translated into lower-level KV read requests that are sent in batches. Correlate these metrics with Request Units (RUs). <a href="https://www.cockroachlabs.com/docs/cockroachcloud/serverless-resource-usage">Learn more</a>.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The number of RUs that were consumed due to KV reads, broken down by requests, batches, and bytes. SQL statements are translated into lower-level KV read requests that are sent in batches. Correlate these metrics with Request Units (RUs). <a href="https://www.cockroachlabs.com/docs/cockroachcloud/serverless-resource-usage">Learn more</a>.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The number of RUs that were consumed due to KV reads, broken down by requests, batches, and bytes. SQL statements are translated into lower-level KV read requests that are sent in batches. Correlate these metrics with Request Units (RUs). <a href="https://www.cockroachlabs.com/docs/cockroachcloud/serverless-resource-usage">Learn more</a>.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The CPU and I/O resources being used by queries in the cluster. Simple queries consume few RUs, while complicated queries with many reads and writes consume more RUs. <a href="https://www.cockroachlabs.com/docs/cockroachcloud/serverless-resource-usage">Learn more.</a>
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The number of RUs that were consumed because of SQL CPU usage. Correlate this metric with Request Units (RUs) and determine if your workload is CPU bound. <a href="https://www.cockroachlabs.com/docs/cockroachcloud/serverless-resource-usage">Learn more.</a>
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The number of RUs that were consumed due to KV writes, broken down by requests, batches, and bytes. SQL statements are translated into lower-level KV read requests that are sent in batches. Correlate these metrics with Request Units (RUs). <a href="https://www.cockroachlabs.com/docs/cockroachcloud/serverless-resource-usage">Learn more</a>.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The number of RUs that were consumed due to KV writes, broken down by requests, batches, and bytes. SQL statements are translated into lower-level KV read requests that are sent in batches. Correlate these metrics with Request Units (RUs). <a href="https://www.cockroachlabs.com/docs/cockroachcloud/serverless-resource-usage">Learn more</a>.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The number of RUs that were consumed due to KV writes, broken down by requests, batches, and bytes. SQL statements are translated into lower-level KV read requests that are sent in batches. Correlate these metrics with Request Units (RUs). <a href="https://www.cockroachlabs.com/docs/cockroachcloud/serverless-resource-usage">Learn more</a>.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This metric is one measure of the impact of contention conflicts on workload performance. For guidance on contention conflicts, review <a href="https://www.cockroachlabs.com/docs/stable/performance-best-practices-overview#transaction-contention">transaction contention best practices</a> and <a href="https://www.cockroachlabs.com/docs/stable/performance-recipes#transaction-contention">performance tuning recipes</a>. Tens of restarts per minute may be a high value, a signal of an elevated degree of contention in the workload, which should be investigated. For the specific error, refer to the <a href="https://www.cockroachlabs.com/docs/stable/transaction-retry-error-reference">transaction retry error reference</a> for more details.
Loading

0 comments on commit cb48172

Please sign in to comment.