Skip to content

Commit

Permalink
Improve nav between cloud and self-hosted backup and restore docs
Browse files Browse the repository at this point in the history
  • Loading branch information
kathancox committed Feb 15, 2024
1 parent 79869e4 commit bfa3c7a
Show file tree
Hide file tree
Showing 21 changed files with 669 additions and 128 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
<table>
<thead>
<tr>
<th>Backup type</th>
<th>Tier</th>
<th>Frequency</th>
<th>Retention</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="2">Full cluster</td>
<td>Dedicated</td>
<td>Daily</td>
<td>30 days</td>
</tr>
</tr>
<tr>
<td>Serverless</td>
<td>Hourly</td>
<td>30 days</td>
</tr>
<tr>
<td rowspan="2">Incremental cluster</td>
<td>Dedicated</td>
<td>Hourly</td>
<td>7 days</td>
</tr>
<tr>
<td>Serverless</td>
<td>None</td>
<td>Not applicable</td>
</tr>
</tbody>
</table>
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{{site.data.alerts.callout_info}}
Metrics are reported per node. Therefore, it is necessary to retrieve metrics from every node in the cluster. For example, if you are monitoring whether a backup fails, it is necessary to track `scheduled_backup_failed` on each node.
{{site.data.alerts.end}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{{site.data.alerts.callout_success}}
We recommend using scheduled backups to automate daily backups of your cluster.
{{site.data.alerts.end}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{{site.data.alerts.callout_info}}
Cockroach Labs recommends enabling Egress Perimeter Controls on CockroachDB {{ site.data.products.dedicated }} clusters to mitigate the risk of data exfiltration when accessing external resources, such as cloud storage for change data capture or backup and restore operations. See [Egress Perimeter Controls](https://www.cockroachlabs.com/docs/cockroachcloud/egress-perimeter-controls) for detail and setup instructions.
{{site.data.alerts.end}}
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@ We recommend starting backups from a time at least 10 seconds in the past using
Only database and table-level backups are possible when using `userfile` as storage. Restoring cluster-level backups will not work because `userfile` data is stored in the `defaultdb` database, and you cannot restore a cluster with existing table data.
{{site.data.alerts.end}}

#### Database and table

When working on the same cluster, `userfile` storage allows for database and table-level backups.

First, run the following statement to backup a database to a directory in the default `userfile` space:
Expand Down
2 changes: 1 addition & 1 deletion src/current/_includes/v23.2/backups/support-products.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
## Supported products

The feature described on this page is available in **CockroachDB {{ site.data.products.dedicated }}**, **CockroachDB {{ site.data.products.serverless }}**, and **CockroachDB {{ site.data.products.core }}** clusters when you are running [customer-owned backups]({% link {{ page.version.version }}/backup-and-restore-overview.md %}#cockroachdb-backup-types). For a full list of features, see [Backup and restore product support]({% link {{ page.version.version }}/backup-and-restore-overview.md %}#backup-and-restore-product-support).
The feature described on this page is available in **CockroachDB {{ site.data.products.dedicated }}**, **CockroachDB {{ site.data.products.serverless }}**, and **CockroachDB {{ site.data.products.core }}** clusters when you are running [customer-owned backups](https://www.cockroachlabs.com/docs/cockroachcloud/take-and-restore-customer-owned-backups). For a full list of features, refer to [Backup and restore product support]({% link {{ page.version.version }}/backup-and-restore-overview.md %}#backup-and-restore-support).
14 changes: 13 additions & 1 deletion src/current/_includes/v23.2/sidebar-data/cloud-deployments.json
Original file line number Diff line number Diff line change
Expand Up @@ -396,17 +396,29 @@
{
"title": "Backups and Restores",
"items": [
{
"title": "Overview",
"urls": [
"/cockroachcloud/backup-and-restore-overview.html"
]
},
{
"title": "Use Managed-Service Backups",
"urls": [
"/cockroachcloud/use-managed-service-backups.html"
]
},
{
"title": "Take and Restore Customer-Owned Backups on CockroachDB Cloud",
"title": "Take and Restore Customer-Owned Backups",
"urls": [
"/cockroachcloud/take-and-restore-customer-owned-backups.html"
]
},
{
"title": "Monitoring",
"urls": [
"/cockroachcloud/backup-and-restore-monitoring.html"
]
}
]
},
Expand Down
77 changes: 77 additions & 0 deletions src/current/cockroachcloud/backup-and-restore-monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
---
title: Backup and Restore Monitoring
summary: An overview of backup and restore monitoring features for CockroachDB Cloud deployments.
toc: true
---

CockroachDB includes metrics to monitor [backup](https://www.cockroachlabs.com/docs/{{site.current_cloud_version}}/backup), [restore](https://www.cockroachlabs.com/docs/{{site.current_cloud_version}}/restore), and [scheduled backup](https://www.cockroachlabs.com/docs/{{site.current_cloud_version}}/create-schedule-for-backup) jobs. You can use monitoring integrations to alert when there are anomalies, such as backups that have failed or restore jobs encountering a retryable error. We recommend setting up monitoring to alert when anomalies occur.

Depending on whether you are using a CockroachDB {{ site.data.products.dedicated }} or CockroachDB {{ site.data.products.serverless }} cluster, you can use the following to monitor backup and restore metrics for your cluster:

- [Cloud Console **Metrics** page]({% link cockroachcloud/metrics-page.md %}): CockroachDB {{ site.data.products.dedicated }}, CockroachDB {{ site.data.products.serverless }}
- [Prometheus](#prometheus): CockroachDB {{ site.data.products.dedicated }}
- [Datadog](#datadog): CockroachDB {{ site.data.products.dedicated }}

You can then use the following SQL statements to inspect details relating to schedules, jobs, and backups:

- [`SHOW SCHEDULES`](https://www.cockroachlabs.com/docs/{{site.current_cloud_version}}/show-schedules)
- [`SHOW JOBS`](https://www.cockroachlabs.com/docs/{{site.current_cloud_version}}/show-jobs)
- [`SHOW BACKUP`](https://www.cockroachlabs.com/docs/{{site.current_cloud_version}}/show-backup)

For detail on [managed-service backups]({% link cockroachcloud/use-managed-service-backups.md %}) that Cockroach Labs stores for your CockroachDB {{ site.data.products.cloud }} cluster, see the **Backup and Restore** page in the Cloud Console.

{% include cockroachcloud/backups/metrics-per-node.md %}

## Prometheus

This section outlines the available backup and restore job metrics with Prometheus. For instructions on accessing the `metricexport` endpoint for Promethus, refer to [Export Metrics From a CockroachDB Dedicated Cluster]({% link cockroachcloud/export-metrics.md %}).

We recommend the following guidelines:

- Use the `schedules.BACKUP.last_completed_time` metric to monitor the specific backup job or jobs you would use to recover from a disaster.
- Configure alerting on the `schedules.BACKUP.last_completed_time` metric to watch for cases where the timestamp has not moved forward as expected.

Metric | Description
-------+-------------
`schedules.BACKUP.failed` | The number of scheduled backup jobs that have failed. **Note:** A stuck scheduled job will not increment this metric.
`schedules.BACKUP.last_completed_time` | The Unix timestamp of the most recently completed scheduled backup specified as maintaining this metric. **Note:** This metric only updates if the schedule was created with the [`updates_cluster_last_backup_time_metric` option](https://www.cockroachlabs.com/docs/{{site.current_cloud_version}}/create-schedule-for-backup#schedule-options).
`schedules.BACKUP.protected_age_sec` | The age of the oldest [protected timestamp record](https://www.cockroachlabs.com/docs/{{site.current_cloud_version}}/create-schedule-for-backup#protected-timestamps-and-scheduled-backups) protected by backup schedules.
`schedules.BACKUP.protected_record_count` | The number of [protected timestamp records](https://www.cockroachlabs.com/docs/{{site.current_cloud_version}}/create-schedule-for-backup#protected-timestamps-and-scheduled-backups) held by backup schedules.
`schedules.BACKUP.started` | The number of scheduled backup jobs that have started.
`schedules.BACKUP.succeeded` | The number of scheduled backup jobs that have succeeded.
`schedules.round.reschedule_skip` | The number of schedules that were skipped due to a currently running job. A value greater than 0 indicates that a previous backup was still running when a new scheduled backup was supposed to start. This corresponds to the [`on_previous_running=skip`](https://www.cockroachlabs.com/docs/{{site.current_cloud_version}}/create-schedule-for-backup#on-previous-running-option) schedule option.
`schedules.round.reschedule_wait` | The number of schedules that were rescheduled due to a currently running job. A value greater than 0 indicates that a previous backup was still running when a new scheduled backup was supposed to start. This corresponds to the [`on_previous_running=wait`](https://www.cockroachlabs.com/docs/{{site.current_cloud_version}}/create-schedule-for-backup#on-previous-running-option) schedule option.
`jobs.backup.currently_paused` | The number of backup jobs currently considered [paused](https://www.cockroachlabs.com/docs/{{site.current_cloud_version}}/pause-job).
`jobs.backup.currently_running` | The number of backup jobs currently running in `Resume` or `OnFailOrCancel` state.
`jobs.backup.fail_or_cancel_retry_error` | The number of backup jobs that failed with a retryable error on their failure or cancelation process.
`jobs.backup.fail_or_cancel_completed` | The number of backup jobs that successfully completed their failure or cancelation process.
`jobs.backup.fail_or_cancel_failed` | The number of backup jobs that failed with a non-retryable error on their failure or cancelation process.
`jobs.backup.protected_age_sec` | The age of the oldest [protected timestamp record](https://www.cockroachlabs.com/docs/{{site.current_cloud_version}}/create-schedule-for-backup#protected-timestamps-and-scheduled-backups) protected by backup jobs.
`jobs.backup.protected_record_count` | The number of [protected timestamp records](https://www.cockroachlabs.com/docs/{{site.current_cloud_version}}/create-schedule-for-backup#protected-timestamps-and-scheduled-backups) held by backup jobs.
`jobs.backup.resume_failed` | The number of backup jobs that failed with a non-retryable error.
`jobs.backup.resume_retry_error` | The number of backup jobs that failed with a retryable error.
`jobs.restore.currently_paused` | The number of restore jobs currently considered [paused](https://www.cockroachlabs.com/docs/{{site.current_cloud_version}}/pause-job).
`jobs.restore.currently_running` | The number of restore jobs currently running in `Resume` or `OnFailOrCancel` state.
`jobs.restore.fail_or_cancel_failed` | The number of restore jobs that failed with a non-retriable error on their failure or cancelation process.
`jobs.restore.fail_or_cancel_retry_error` | The number of restore jobs that failed with a retryable error on their failure or cancelation process.
`jobs.restore.protected_age_sec` | The age of the oldest [protected timestamp record](https://www.cockroachlabs.com/docs/{{site.current_cloud_version}}//architecture/storage-layer#protected-timestamps) protected by restore jobs.
`jobs.restore.protected_record_count` | The number of [protected timestamp records](https://www.cockroachlabs.com/docs/{{site.current_cloud_version}}//architecture/storage-layer#protected-timestamps) held by restore jobs.
`jobs.restore.resume_completed` | The number of restore jobs that successfully resumed to completion.
`jobs.restore.resume_failed` | The number of restore jobs that failed with a non-retryable error.
`jobs.restore.resume_retry_error` | The number of restore jobs that failed with a retryable error.

## Datadog

To use the Datadog integration with your CockroachDB {{ site.data.products.dedicated }} cluster, you can:

- Export the following schedule backup metrics to Datadog using the [Cloud API]({% link cockroachcloud/cloud-api.md %}). To set this up, refer to [Export Metrics From a CockroachDB Dedicated Cluster]({% link cockroachcloud/export-metrics.md %}).
- Access the Cloud Console **Monitoring** page to enable the integration. To set this up, refer to [Monitor CockroachDB Dedicated with Datadog]({% link cockroachcloud/tools-page.md %}#monitor-cockroachdb-dedicated-with-datadog).

### Available metrics in Datadog

Metric | Description
-------+-------------
`schedules.BACKUP.succeeded` | The number of scheduled backup jobs that have succeeded.
`schedules.BACKUP.started` | The number of scheduled backup jobs that have started.
`schedules.BACKUP.last_completed_time` | The Unix timestamp of the most recently completed backup by a schedule specified as maintaining this metric.
`schedules.BACKUP.failed` | The number of scheduled backup jobs that have failed.
Loading

0 comments on commit bfa3c7a

Please sign in to comment.