Skip to content

Commit

Permalink
Add TODOs and copy content from Collector repo
Browse files Browse the repository at this point in the history
  • Loading branch information
tiffany76 committed Apr 18, 2024
1 parent 342ac3b commit 1e55e89
Showing 1 changed file with 112 additions and 0 deletions.
112 changes: 112 additions & 0 deletions content/en/docs/collector/internal-telemetry.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,3 +113,115 @@ journalctl | grep otelcol | grep Error
```

{{% /tab %}} {{< /tabpane >}}

## Types of internal observability

<!--- TODO: Add intro sentence. --->

<!--- TODO: Figure out which of these values are available now and which are still on the roadmap. --->

### Current values that need observation

- Resource consumption: CPU, RAM (in the future also IO - if we implement
persistent queues) and any other metrics that may be available to Go apps
(e.g. garbage size, etc).

- Receiving data rate, broken down by receivers and by data type
(traces/metrics).

- Exporting data rate, broken down by exporters and by data type
(traces/metrics).

- Data drop rate due to throttling, broken down by data type.

- Data drop rate due to invalid data received, broken down by data type.

- Current throttling state: Not Throttled/Throttled by Downstream/Internally
Saturated.

- Incoming connection count, broken down by receiver.

- Incoming connection rate (new connections per second), broken down by
receiver.

- In-memory queue size (in bytes and in units). Note: measurements in bytes may
be difficult / expensive to obtain and should be used cautiously.

- Persistent queue size (when supported).

- End-to-end latency (from receiver input to exporter output). Note that with
multiple receivers/exporters we potentially have NxM data paths, each with
different latency (plus different pipelines in the future), so realistically
we should likely expose the average of all data paths (perhaps broken down by
pipeline).

- Latency broken down by pipeline elements (including exporter network roundtrip
latency for request/response protocols).

“Rate” values must reflect the average rate of the last 10 seconds. Rates must
exposed in bytes/sec and units/sec (e.g. spans/sec).

Note: some of the current values and rates may be calculated as derivatives of
cumulative values in the backend, so it is an open question if we want to expose
them separately or no.

### Cumulative values that need observation

- Total received data, broken down by receivers and by data type
(traces/metrics).

- Total exported data, broken down by exporters and by data type
(traces/metrics).

- Total dropped data due to throttling, broken down by data type.

- Total dropped data due to invalid data received, broken down by data type.

- Total incoming connection count, broken down by receiver.

- Uptime since start.

### Trace or log on events

We want to generate the following events (log and/or send as a trace with
additional data):

- Collector started/stopped.

- Collector reconfigured (if we support on-the-fly reconfiguration).

- Begin dropping due to throttling (include throttling reason, e.g. local
saturation, downstream saturation, downstream unavailable, etc).

- Stop dropping due to throttling.

- Begin dropping due to invalid data (include sample/first invalid data).

- Stop dropping due to invalid data.

- Crash detected (differentiate clean stopping and crash, possibly include crash
data if available).

For begin/stop events we need to define an appropriate hysteresis to avoid
generating too many events. Note that begin/stop events cannot be detected in
the backend simply as derivatives of current rates, the events include
additional data that is not present in the current value.

### Host metrics

The service should collect host resource metrics in addition to service's own
process metrics. This may help to understand that the problem that we observe in
the service is induced by a different process on the same host.

### Data ingress

The `otelcol_receiver_accepted_spans` and
`otelcol_receiver_accepted_metric_points` metrics provide information about the
data ingested by the Collector.

### Data egress

The `otecol_exporter_sent_spans` and `otelcol_exporter_sent_metric_points`
metrics provide information about the data exported by the Collector.

<!--- TODO: Breakdown by signal and add definitions. Include extensions here? --->

0 comments on commit 1e55e89

Please sign in to comment.