From 254337e19b38dbcf5a1205c75f62efc24d705c05 Mon Sep 17 00:00:00 2001 From: Tiffany Hrabusa <30397949+tiffany76@users.noreply.github.com> Date: Fri, 14 Jun 2024 03:45:36 -0700 Subject: [PATCH] Unify internal observability documentation - 3 of 3 (#4529) Co-authored-by: Fabrizio Ferri-Benedetti Co-authored-by: opentelemetrybot <107717825+opentelemetrybot@users.noreply.github.com> --- .../en/docs/collector/internal-telemetry.md | 75 ++++- content/en/docs/collector/troubleshooting.md | 266 ++++++++++++++++-- 2 files changed, 310 insertions(+), 31 deletions(-) diff --git a/content/en/docs/collector/internal-telemetry.md b/content/en/docs/collector/internal-telemetry.md index 0b300553fb2e..b54a555eca02 100644 --- a/content/en/docs/collector/internal-telemetry.md +++ b/content/en/docs/collector/internal-telemetry.md @@ -5,10 +5,11 @@ weight: 25 cSpell:ignore: alloc journalctl kube otecol pprof tracez underperforming zpages --- -You can monitor the health of any OpenTelemetry Collector instance by checking +You can inspect the health of any OpenTelemetry Collector instance by checking its own internal telemetry. Read on to learn about this telemetry and how to -configure it to help you [troubleshoot](/docs/collector/troubleshooting/) -Collector issues. +configure it to help you +[monitor](#use-internal-telemetry-to-monitor-the-collector) and +[troubleshoot](/docs/collector/troubleshooting/) the Collector. ## Activate internal telemetry in the Collector @@ -97,9 +98,9 @@ critical analysis. ### Configure internal logs Log output is found in `stderr`. You can configure logs in the config -`service::telemetry::logs`. The [configuration -options](https://github.com/open-telemetry/opentelemetry-collector/blob/v{{% param -vers %}}/service/telemetry/config.go) are: +`service::telemetry::logs`. The +[configuration options](https://github.com/open-telemetry/opentelemetry-collector/blob/main/service/telemetry/config.go) +are: | Field name | Default value | Description | | ---------------------- | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -133,7 +134,7 @@ journalctl | grep otelcol | grep Error {{% /tab %}} {{< /tabpane >}} -## Types of internal observability +## Types of internal telemetry The OpenTelemetry Collector aims to be a model of observable service by clearly exposing its own operational metrics. Additionally, it collects host resource @@ -272,3 +273,63 @@ The Collector logs the following internal events: - Data dropping due to invalid data stops. - A crash is detected, differentiated from a clean stop. Crash data is included if available. + +## Use internal telemetry to monitor the Collector + +This section recommends best practices for monitoring the Collector using its +own telemetry. + +### Critical monitoring + +#### Data loss + +Use the rate of `otelcol_processor_dropped_spans > 0` and +`otelcol_processor_dropped_metric_points > 0` to detect data loss. Depending on +your project's requirements, select a narrow time window before alerting begins +to avoid notifications for small losses that are within the desired reliability +range and not considered outages. + +### Secondary monitoring + +#### Queue length + +Most exporters provide a +[queue or retry mechanism](https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md) +that is recommended for use in any production deployment of the Collector. + +The `otelcol_exporter_queue_capacity` metric indicates the capacity, in batches, +of the retry queue. The `otelcol_exporter_queue_size` metric indicates the +current size of the retry queue. Use these two metrics to check if the queue +capacity can support your workload. + +Using the following three metrics, you can identify the number of spans, metric +points, and log records that failed to reach the sending queue: + +- `otelcol_exporter_enqueue_failed_spans` +- `otelcol_exporter_enqueue_failed_metric_points` +- `otelcol_exporter_enqueue_failed_log_records` + +These failures could be caused by a queue filled with unsettled elements. You +might need to decrease your sending rate or horizontally scale Collectors. + +The queue or retry mechanism also supports logging for monitoring. Check the +logs for messages such as `Dropping data because sending_queue is full`. + +#### Receive failures + +Sustained rates of `otelcol_receiver_refused_spans` and +`otelcol_receiver_refused_metric_points` indicate that too many errors were +returned to clients. Depending on the deployment and the clients' resilience, +this might indicate clients' data loss. + +Sustained rates of `otelcol_exporter_send_failed_spans` and +`otelcol_exporter_send_failed_metric_points` indicate that the Collector is not +able to export data as expected. These metrics do not inherently imply data loss +since there could be retries. But a high rate of failures could indicate issues +with the network or backend receiving the data. + +#### Data flow + +You can monitor data ingress with the `otelcol_receiver_accepted_spans` and +`otelcol_receiver_accepted_metric_points` metrics and data egress with the +`otecol_exporter_sent_spans` and `otelcol_exporter_sent_metric_points` metrics. diff --git a/content/en/docs/collector/troubleshooting.md b/content/en/docs/collector/troubleshooting.md index 8278d00b678b..e48030b648fb 100644 --- a/content/en/docs/collector/troubleshooting.md +++ b/content/en/docs/collector/troubleshooting.md @@ -1,37 +1,135 @@ --- title: Troubleshooting -description: Recommendations for troubleshooting the collector +description: Recommendations for troubleshooting the Collector weight: 25 +cSpell:ignore: pprof tracez zpages --- -This page describes some options when troubleshooting the health or performance -of the OpenTelemetry Collector. The Collector provides a variety of metrics, -logs, and extensions for debugging issues. +On this page, you can learn how to troubleshoot the health and performance of +the OpenTelemetry Collector. -## Internal telemetry +## Troubleshooting tools + +The Collector provides a variety of metrics, logs, and extensions for debugging +issues. + +### Internal telemetry You can configure and use the Collector's own [internal telemetry](/docs/collector/internal-telemetry/) to monitor its performance. -## Sending test data +### Local exporters + +For certain types of issues, such as configuration verification and network +debugging, you can send a small amount of test data to a Collector configured to +output to local logs. Using a +[local exporter](https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter#general-information), +you can inspect the data being processed by the Collector. + +For live troubleshooting, consider using the +[`debug` exporter](https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/debugexporter/README.md), +which can confirm that the Collector is receiving, processing, and exporting +data. For example: + +```yaml +receivers: + zipkin: +exporters: + debug: +service: + pipelines: + traces: + receivers: [zipkin] + processors: [] + exporters: [debug] +``` + +To begin testing, generate a Zipkin payload. For example, you can create a file +called `trace.json` that contains: + +```json +[ + { + "traceId": "5982fe77008310cc80f1da5e10147519", + "parentId": "90394f6bcffb5d13", + "id": "67fae42571535f60", + "kind": "SERVER", + "name": "/m/n/2.6.1", + "timestamp": 1516781775726000, + "duration": 26000, + "localEndpoint": { + "serviceName": "api" + }, + "remoteEndpoint": { + "serviceName": "apip" + }, + "tags": { + "data.http_response_code": "201" + } + } +] +``` + +With the Collector running, send this payload to the Collector: + +```shell +curl -X POST localhost:9411/api/v2/spans -H'Content-Type: application/json' -d @trace.json +``` + +You should see a log entry like the following: -For certain types of issues, particularly verifying configuration and debugging -network issues, it can be helpful to send a small amount of data to a collector -configured to output to local logs. For details, see -[Local exporters](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/troubleshooting.md#local-exporters). +```shell +2023-09-07T09:57:43.468-0700 info TracesExporter {"kind": "exporter", "data_type": "traces", "name": "debug", "resource spans": 1, "spans": 2} +``` + +You can also configure the `debug` exporter so the entire payload is printed: + +```yaml +exporters: + debug: + verbosity: detailed +``` + +If you re-run the previous test with the modified configuration, the log output +looks like this: + +```shell +2023-09-07T09:57:12.820-0700 info TracesExporter {"kind": "exporter", "data_type": "traces", "name": "debug", "resource spans": 1, "spans": 2} +2023-09-07T09:57:12.821-0700 info ResourceSpans #0 +Resource SchemaURL: https://opentelemetry.io/schemas/1.4.0 +Resource attributes: + -> service.name: Str(telemetrygen) +ScopeSpans #0 +ScopeSpans SchemaURL: +InstrumentationScope telemetrygen +Span #0 + Trace ID : 0c636f29e29816ea76e6a5b8cd6601cf + Parent ID : 1a08eba9395c5243 + ID : 10cebe4b63d47cae + Name : okey-dokey + Kind : Internal + Start time : 2023-09-07 16:57:12.045933 +0000 UTC + End time : 2023-09-07 16:57:12.046058 +0000 UTC + Status code : Unset + Status message : +Attributes: + -> span.kind: Str(server) + -> net.peer.ip: Str(1.2.3.4) + -> peer.service: Str(telemetrygen) +``` -## Check available components in the Collector +### Check Collector components Use the following sub-command to list the available components in a Collector distribution, including their stability levels. Please note that the output -format may change across versions. +format might change across versions. -```sh +```shell otelcol components ``` -Sample output +Sample output: ```yaml buildinfo: @@ -120,24 +218,144 @@ extensions: extension: Beta ``` +### Extensions + +Here is a list of extensions you can enable for debugging the Collector. + +#### Performance Profiler (pprof) + +The +[pprof extension](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/pprofextension/README.md), +which is available locally on port `1777`, allows you to profile the Collector +as it runs. This is an advanced use-case that should not be needed in most +circumstances. + +#### zPages + +The +[zPages extension](https://github.com/open-telemetry/opentelemetry-collector/tree/main/extension/zpagesextension/README.md), +which is exposed locally on port `55679`, can be used to inspect live data from +the Collector's receivers and exporters. + +The TraceZ page, exposed at `/debug/tracez`, is useful for debugging trace +operations, such as: + +- Latency issues. Find the slow parts of an application. +- Deadlocks and instrumentation problems. Identify running spans that don't end. +- Errors. Determine what types of errors are occurring and where they happen. + +Note that `zpages` might contain error logs that the Collector does not emit +itself. + +For containerized environments, you might want to expose this port on a public +interface instead of just locally. The `endpoint` can be configured using the +`extensions` configuration section: + +```yaml +extensions: + zpages: + endpoint: 0.0.0.0:55679 +``` + ## Checklist for debugging complex pipelines It can be difficult to isolate problems when telemetry flows through multiple -collectors and networks. For each "hop" of telemetry data through a collector or -other component in your telemetry pipeline, it’s important to verify the -following: +Collectors and networks. For each "hop" of telemetry through a Collector or +other component in your pipeline, it’s important to verify the following: -- Are there error messages in the logs of the collector? +- Are there error messages in the logs of the Collector? - How is the telemetry being ingested into this component? -- How is the telemetry being modified (i.e. sampling, redacting) by this - component? +- How is the telemetry being modified (for example, sampling or redacting) by + this component? - How is the telemetry being exported from this component? - What format is the telemetry in? - How is the next hop configured? - Are there any network policies that prevent data from getting in or out? -### More +## Common Collector issues + +This section covers how to resolve common Collector issues. + +### Collector is experiencing data issues + +The Collector and its components might experience data issues. + +#### Collector is dropping data + +The Collector might drop data for a variety of reasons, but the most common are: + +- The Collector is improperly sized, resulting in an inability to process and + export the data as fast as it is received. +- The exporter destination is unavailable or accepting the data too slowly. + +To mitigate drops, configure the +[`batch` processor](https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/batchprocessor/README.md). +In addition, it might be necessary to configure the +[queued retry options](https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/exporterhelper#configuration) +on enabled exporters. + +#### Collector is not receiving data + +The Collector might not receive data for the following reasons: + +- A network configuration issue. +- An incorrect receiver configuration. +- An incorrect client configuration. +- The receiver is defined in the `receivers` section but not enabled in any + `pipelines`. + +Check the Collector's +[logs](/docs/collector/internal-telemetry/#configure-internal-logs) as well as +[zPages](https://github.com/open-telemetry/opentelemetry-collector/blob/main/extension/zpagesextension/README.md) +for potential issues. + +#### Collector is not processing data + +Most processing issues result from of a misunderstanding of how the processor +works or a misconfiguration of the processor. For example: + +- The attributes processor works only for "tags" on spans. The span name is + handled by the span processor. +- Processors for trace data (except tail sampling) work only on individual + spans. + +#### Collector is not exporting data + +The Collector might not export data for the following reasons: + +- A network configuration issue. +- An incorrect exporter configuration. +- The destination is unavailable. + +Check the Collector's +[logs](/docs/collector/internal-telemetry/#configure-internal-logs) as well as +[zPages](https://github.com/open-telemetry/opentelemetry-collector/blob/main/extension/zpagesextension/README.md) +for potential issues. + +Exporting data often does not work because of a network configuration issue, +such as a firewall, DNS, or proxy issue. Note that the Collector does have +[proxy support](https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter#proxy-support). + +### Collector is experiencing control issues + +The Collector might experience failed startups or unexpected exits or restarts. + +#### Collector exits or restarts + +The Collector might exit or restart due to: + +- Memory pressure from a missing or misconfigured + [`memory_limiter` processor](https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/memorylimiterprocessor/README.md). +- Improper sizing for load. +- Improper configuration. For example, a queue size configured higher than + available memory. +- Infrastructure resource limits. For example, Kubernetes. + +#### Collector fails to start in Windows Docker containers -For detailed recommendations, including common problems, see -[Troubleshooting](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/troubleshooting.md) -from the Collector repository. +With v0.90.1 and earlier, the Collector might fail to start in a Windows Docker +container, producing the error message +`The service process could not connect to the service controller`. In this case, +the `NO_WINDOWS_SERVICE=1` environment variable must be set to force the +Collector to start as if it were running in an interactive terminal, without +attempting to run as a Windows service.