-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC-8878 Enhance Essential Metrics with Alert guidance #18537
Conversation
Files changed:
|
✅ Deploy Preview for cockroachdb-api-docs canceled.
|
✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.
|
✅ Netlify Preview
To edit notification comments on pull requests, go to your Netlify site configuration. |
…lerts from private repo and alertmanager alerts from public docs. (2) In essential-metrics.md, added anchors to metrics used in essential-alerts-self-hosted.md.
… sections as essential-metrics.md.
@andyyang890 OR @wenyihu6: please review the alert for Changefeed experiencing high latency. @dikshant: please review the alerts for: |
…alerts on storage.write-stalls and schedules.BACKUP.failed. (b) fixed case of headings to sentence case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changefeed experiencing high latency
section LGTM! Will defer to others for final approval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great Florence! Excited to get this published.
## SQL | ||
|
||
### Node not executing SQL | ||
|
||
Send an alert when a node is not executing SQL despite having connections. `sql.conns` shows the number of connections as well as the distribution, or balancing, of connections across cluster nodes. An imbalance can lead to nodes becoming overloaded. | ||
|
||
**Metric** | ||
<br>[`sql.conns`]({% link {{ page.version.version }}/essential-metrics-self-hosted.md %}#sql-conns) | ||
<br>`sql.query.count` | ||
|
||
**Rule** | ||
<br>Set alerts for each node: | ||
<br>WARNING: `sql.conns` greater than `0` while `sql.query.count` equals `0` | ||
|
||
**Action** | ||
|
||
- Refer to [Connection Pooling]({% link {{ page.version.version }}/connection-pooling.md %}). | ||
|
||
### SQL query failure | ||
|
||
Send an alert when the query failure count exceeds a user-determined threshold based on their application's SLA. | ||
|
||
**Metric** | ||
<br>[`sql.failure.count`]({% link {{ page.version.version }}/essential-metrics-self-hosted.md %}#sql-failure-count) | ||
|
||
**Rule** | ||
<br>WARNING: `sql.failure.count` is greater than a threshold (based on the user’s application SLA) | ||
|
||
**Action** | ||
|
||
- Use the [**Insights** page]({% link {{ page.version.version }}/ui-insights-page.md %}) to find failed executions with their error code to troubleshoot or use application-level logs, if instrumented, to determine the cause of error. | ||
|
||
### SQL queries experiencing high latency | ||
|
||
Send an alert when the query latency exceeds a user-determined threshold based on their application’s SLA. | ||
|
||
**Metric** | ||
<br>[`sql.service.latency`]({% link {{ page.version.version }}/essential-metrics-self-hosted.md %}#sql-service-latency) | ||
<br>[`sql.conn.latency`]({% link {{ page.version.version }}/essential-metrics-self-hosted.md %}#sql-conn-latency) | ||
|
||
**Rule** | ||
<br>WARNING: (p99 or p90 of `sql.service.latency` plus average of `sql.conn.latency`) is greater than a threshold (based on the user’s application SLA) | ||
|
||
**Action** | ||
|
||
- Apply the time range of the alert to the [**SQL Activity** pages]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#sql-activity-pages) to investigate. Use the [**Statements** page]({% link {{ page.version.version }}/ui-statements-page.md %}) P90 Latency and P99 latency columns to correlate [statement fingerprints]({% link {{ page.version.version }}/ui-statements-page.md %}#sql-statement-fingerprints) with this alert. | ||
|
||
{% if include.deployment == 'self-hosted' %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! @mgartner do you have any thoughts/concerns on highlighting these as essential metrics to monitor for SQL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mgartner , when you have a chance, would you be able to review this section: SQL queries experiencing high latency?
In essential-metrics.md, took cloud-2.0 version and manually added links to metrics used by essential-alerts.md. Renamed essential-alerts-dedicated.md to essential-alerts-advanced.md. In essential-alerts.md, (a) replaced dedicated with advanced, (b) replaced links to essential-metrics-self-hosted.md with essential-metrics-{{ include.deployment }}.md. In cloud-deployments.json, added link to essential-alerts-advanced.md.
…th link to Essential Alerts. copied v24.1 changed files to v24.2
Hi @kathancox, I have made the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great Florence! Pending your review of the comments, it looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
Fixes DOC-8878
(1) Added essential-alerts.md include file, a compilation of alerts from (a) a-entin's repo, (b) alertmanager alerts in public docs, and (c) Skills Taxonomy private doc.
(2) In essential-metrics.md, (a) added anchors to metrics used in essential-alerts.md, (b) added link to essential alerts.
(3) Added essential-alerts-self-hosted.md to display all the essential alerts.
(4) Added essential-alerts-advanced.md to display a subset of the essential alerts applicable to advanced clusters.
(5) In self-hosted-deployments.json and cloud-deployments.json, added links to the corresponding essential alerts pages.
Rendered preview: