diff --git a/docs/sources/tempo/configuration/grafana-agent/service-graphs.md b/docs/sources/tempo/configuration/grafana-agent/service-graphs.md index 457425c900e..a80ddc93fee 100644 --- a/docs/sources/tempo/configuration/grafana-agent/service-graphs.md +++ b/docs/sources/tempo/configuration/grafana-agent/service-graphs.md @@ -1,6 +1,7 @@ --- title: Enable service graphs menuTitle: Enable service graphs +description: Service graphs help to understand the structure of a distributed system, and the connections and dependencies between its components. weight: aliases: - /docs/tempo/grafana-agent/service-graphs @@ -60,4 +61,5 @@ metrics: The same service graph metrics can also be generated by Tempo. This is more efficient and recommended for larger installations. -For additional information about viewing service graph metrics in Grafana and calculating cardinality, check out the [server side documentation](../../metrics-generator/service_graphs#grafana). + +For additional information about viewing service graph metrics in Grafana and calculating cardinality, refer to the [server side documentation]({{< relref "../../metrics-generator/service_graphs#enable-service-graphs-in-Grafana" >}}). diff --git a/docs/sources/tempo/metrics-generator/_index.md b/docs/sources/tempo/metrics-generator/_index.md index 41084e589de..808cc7a6ac2 100644 --- a/docs/sources/tempo/metrics-generator/_index.md +++ b/docs/sources/tempo/metrics-generator/_index.md @@ -3,6 +3,7 @@ aliases: - /docs/tempo/latest/server_side_metrics/ - /docs/tempo/latest/metrics-generator/ title: Metrics-generator +description: Metrics-generator is an optional Tempo component that derives metrics from ingested traces. weight: 500 --- @@ -12,7 +13,9 @@ Metrics-generator is an optional Tempo component that derives metrics from inges If present, the distributor will write received spans to both the ingester and the metrics-generator. The metrics-generator processes spans and writes metrics to a Prometheus data source using the Prometheus remote write protocol. ->**Note**: Enabling metrics generation and remote writing them to Grafana Cloud Metrics produces extra active series that could impact your billing. For more information on billing, refer to [Billing and usage](https://grafana.com/docs/grafana-cloud/billing-and-usage/). +{{% admonition type="note" %}} +Enabling metrics generation and remote writing them to Grafana Cloud Metrics produces extra active series that could impact your billing. For more information on billing, refer to [Billing and usage](/docs/grafana-cloud/billing-and-usage/). +{{% /admonition %}} ## Overview @@ -20,14 +23,14 @@ Metrics-generator leverages the data available in Tempo's ingest path to provide The metrics-generator internally runs a set of **processors**. Each processor ingests spans and produces metrics. -Every processor derives different metrics. Currently the following processors are available: +Every processor derives different metrics. Currently, the following processors are available: - Service graphs - Span metrics

Service metrics architecture

-### Service graphs +## Service graphs Service graphs are the representations of the relationships between services within a distributed system. @@ -35,9 +38,9 @@ This service graphs processor builds a map of services by analyzing traces, with Edges are spans with a parent-child relationship, that represent a jump (e.g. a request) between two services. The amount of request and their duration are recorded as metrics, which are used to represent the graph. -To learn more about this processor, read the [documentation]({{< relref "service_graphs" >}}). +To learn more about this processor, read the [documentation]({{< relref "./service_graphs" >}}). -### Span metrics +## Span metrics The span metrics processor derives RED (Request, Error and Duration) metrics from spans. @@ -45,13 +48,13 @@ The span metrics processor will compute the total count and the duration of span Dimensions can be the service name, the operation, the span kind, the status code and any tag or attribute present in the span. The more dimensions are enabled, the higher the cardinality of the generated metrics. -To learn more about this processor, read the [documentation]({{< relref "span_metrics" >}}). +To learn more about this processor, read the [documentation]({{< relref "./span_metrics" >}}). -### Remote writing metrics +## Remote writing metrics The metrics-generator runs a Prometheus Agent that periodically sends metrics to a `remote_write` endpoint. The `remote_write` endpoint is configurable and can be any [Prometheus-compatible endpoint](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write). -To learn more about the endpoint configuration, refer to the [Metrics-generator]({{< relref "../configuration/#metrics-generator" >}}) section of the Tempo Configuration documentation. +To learn more about the endpoint configuration, refer to the [Metrics-generator]({{< relref "../configuration#metrics-generator" >}}) section of the Tempo Configuration documentation. Writing interval can be controlled via `metrics_generator.registry.collection_interval`. -When multi-tenancy is enabled, the metrics-generator forwards the `X-Scope-OrgID` header of the original request to the remote_write endpoint. +When multi-tenancy is enabled, the metrics-generator forwards the `X-Scope-OrgID` header of the original request to the `remote_write` endpoint. \ No newline at end of file diff --git a/docs/sources/tempo/metrics-generator/active-series.md b/docs/sources/tempo/metrics-generator/active-series.md new file mode 100644 index 00000000000..33776a5214b --- /dev/null +++ b/docs/sources/tempo/metrics-generator/active-series.md @@ -0,0 +1,124 @@ +--- +aliases: +- /docs/tempo/latest/metrics-generator/active-series +title: Active series +menuTitle: Active series +description: Learn about active series and how they are calculated. +weight: 100 +--- + +# Active series + +An active series is a time series that receives new data points or samples. When you stop writing new data points to a time series, shortly afterwards it is no longer considered active. + +Metrics generated by Tempo's metrics generator can provide both RED (Rate/Error/Duration) metrics and interdependency graphs between services in a trace (the Service Graph functionality in Grafana). +These capabilities rely on a set of generated span metrics and service metrics. + +Any spans that are ingested by Tempo can create many metric series. However, this doesn't mean that every time a span is ingested that a new active series is created. + +The number of active series generated depends on the label pairs generated from span data that are associated with the metrics, similar to other Prometheus-formated data. + +For additional information, refer to the [Active series and DPM documentation](/docs/grafana-cloud/billing-and-usage/active-series-and-dpm/#active-series). + +## Active series calculation + +Active series for a metric increase when a new value for a label key is introduced. For example, the `span_kind` label has a total of five possible values, and the `status_code` label has a total of three possible values. + +At first glance, you might make an assumption that this means that at least 15 (5*3) active series will be generated for each span. But this isn't the case. + +Let's consider a span that's emitted from some piece of code in a service: + +![Single span visualization](/static/img/docs/tempo/SingleSpan.jpeg) + +Here's a single service with a single span. +If the code inside the span never leaves the service, then the `span_kind` label generated by the metrics generator will be `SPAN_KIND_INTERNAL` and never deviate. It'll never be one of the other four possible values. + +Similarly, if the code inside the span never errors, it'll only have the `STATUS_CODE_OK` state for the `span_status` label. +This means that the metrics generator will only generate a single active series, where the service name will be _Service 1_ and the span name will be _span1_. +If we looked at the Prometheus data for the `traces_spanmetrics_call_total` metric, we'd see a single active series: + +| service | span_name | span_kind | status_code | Metric value | +| --------- | --------- | ------------------ | -------------- | ------------ | +| Service 1 | span1 | SPAN_KIND_INTERNAL | STATUS_CODE_OK | 1 | + +It doesn't matter how many times that span occurs in a trace either, for example maybe a span is generated within a loop. +In code run once, 10 times, 100 times, 1000 times, only a single active series will be produced, where a counter might be increased 1, 10, 100, or 1000 times: + +![Single span with loop](/static/img/docs/tempo/SingleSpanLoop.jpeg) + +If you looked at the Prometheus data, you'd see an instant value for `traces_spanmetrics_call_total` similar to the table. Again, one active series for the metric: + +| service | span_name | span_kind | status_code | Metric value | +| --------- | --------- | ------------------ | -------------- | ------------ | +| Service 1 | span1 | SPAN_KIND_INTERNAL | STATUS_CODE_OK | 120 | + + +However, let's now assume that it does loop and there are occasionally errors. + +![Single span with loop and errors](/static/img/docs/tempo/SinglespanLoopError.jpeg) + +There are now two potential outcomes for a span when the code loops: one where everything successfully completes and one where there is an error. +This means that when the span completes `status_code` is now either `STATUS_CODE_OK` or `STATUS_CODE_ERROR`. +Because of that, the label values can be one of two values on a metric, and we now have two active series being generated based on the `status_code`, one for the `OK` status and one for the error. + +Again, we could loop once, 10 times, 100, or more times, but there will only ever be two active series. + +If we now looked at Prometheus instant values for `traces_spanmetrics_call_total`, we'd now see the following table: + +| service | span_name | span_kind | status_code | Metric value | +| --------- | --------- | ------------------ | ----------------- | ------------ | +| Service 1 | span1 | SPAN_KIND_INTERNAL | STATUS_CODE_OK | 96 | +| Service 1 | span1 | SPAN_KIND_INTERNAL | STATUS_CODE_ERROR | 24 | + +What happens if you call out to another service though? Let's add an option where, based on some arbitrary data, we sometimes make a downstream call to another service, but otherwise continue to runs loops in our own service: + +![Multiple spans with loops and errors](/static/img/docs/tempo/SingleSpanLoopErrorAnotherService.jpeg) + +In this scenario, `span1`'s `span_kind` label would now be one of either `SPAN_KIND_INTERNAL` or `SPAN_KIND_CLIENT` (as it has acted as a client calling a downstream server). +If a call to the downstream service could also potentially fail, then for `SPAN_KIND_CLIENT`, the `status_code` could be either `STATUS_CODE_ERROR` or `STATUS_CODE_OK`. + +At this point, `traces_spanmetrics_call_total` would have four different variations in labels: + +| service | span_name | span_kind | status_code | Metric value | +| --------- | --------- | ------------------ | ----------------- | ------------ | +| Service 1 | span1 | SPAN_KIND_INTERNAL | STATUS_CODE_OK | 34 | +| Service 1 | span1 | SPAN_KIND_INTERNAL | STATUS_CODE_ERROR | 6 | +| Service 1 | span1 | SPAN_KIND_CLIENT | STATUS_CODE_OK | 23 | +| Service 1 | span1 | SPAN_KIND_CLIENT | STATUS_CODE_ERROR | 3 | + +Because of the variation in values, we now have four active series for our metric instead of one. But, as far as Service 1 is concerned, there's still only four active series, because there isn't any other variation of the values for labels. You can run 1 trace, 10 traces, 100 traces (each with however many loops of spans there are) and only four active series will ever be produced. + +We've actually only told half the story in our last diagram. _Service 1_ called a second service, _Service 2_, which continues the trace by adding a new span, `span2`. +If there was a loop inside Service 2 with a single span that was generated from an upstream call from Service 1, and then a number of spans that were driven internally, which could also error, we'd end up with the possible values in the metric for `traces_spanmetrics_call_total` below: + +| service | span_name | span_kind | status_code | Metric value | +| --------- | --------- | ------------------ | ----------------- | ------------ | +| Service 1 | span1 | SPAN_KIND_INTERNAL | STATUS_CODE_OK | 89 | +| Service 1 | span1 | SPAN_KIND_INTERNAL | STATUS_CODE_ERROR | 13 | +| Service 1 | span1 | SPAN_KIND_CLIENT | STATUS_CODE_OK | 44 | +| Service 1 | span1 | SPAN_KIND_CLIENT | STATUS_CODE_ERROR | 9 | +| Service 2 | span2 | SPAN_KIND_SERVER | STATUS_CODE_OK | 30 | +| Service 2 | span2 | SPAN_KIND_SERVER | STATUS_CODE_ERROR | 14 | +| Service 2 | span2 | SPAN_KIND_INTERNAL | STATUS_CODE_OK | 99 | +| Service 2 | span2 | SPAN_KIND_INTERNAL | STATUS_CODE_ERROR | 23 | + +At this point, all our traces will be composed of two potential span names, each of which produce two separate types of `span_kind` and two separate types of `status_code`. So we have eight active series for a metric. + +The variability of values for each potential span condition determines the number of active series being produced by Tempo when ingesting spans for a trace, and not the number of traces of spans that are seen. + +## Custom span attributes + +There's another consideration for active series: extra label key/value pairs that can be added onto metrics from a span's attributes. +The Tempo metrics generator allows the user to use arbitrary span attributes to be created as label pairs for metrics. +When considering the number of active series generated, you also need to determine how many possible values there are for the span attribute being turned into a label. + +For example, if you added an `http.method` span attribute into a metric label pair, there are five possible values (because there are five possible REST methods): + +- `HEAD` +- `GET` +- `POST` +- `PUT` +- `DELETE` + +If this label pair is added to every span metric, that's another 5 *potential* active series generated for each metric (in all likelihood this is a very worst case scenario, very few spans will call all five REST methods). +Instead of 8 active series in the last table above, we'd have 40 (8 * 5). \ No newline at end of file diff --git a/docs/sources/tempo/metrics-generator/cardinality.md b/docs/sources/tempo/metrics-generator/cardinality.md new file mode 100644 index 00000000000..279ca4b41f9 --- /dev/null +++ b/docs/sources/tempo/metrics-generator/cardinality.md @@ -0,0 +1,50 @@ +--- +aliases: +- /docs/tempo/latest/metrics-generator/cardinality +title: Cardinality +menuTitle: Cardinality +description: What is cardinality and how it is impacted by metrics generation? +weight: 100 +--- + +# Cardinality + +Cardinality refers to the total combination of key/value pairs, such as labels and label values for a given metric series or log stream, and how many unique combinations they generate. +For more information on cardinality, see the [What are cardinality spikes and why do they matter?](/blog/2022/02/15/what-are-cardinality-spikes-and-why-do-they-matter/) blog post. + +Because writes to a time-series database (TSDB) database are in series, high cardinality does not make a big difference to performance at ingest. +However, cardinality can have a major impact on querying where, the higher the cardinality, the more items are required to be iterated over. + +## Traces collection and metrics + +Tempo’s server-side metrics generation adds functionality to the collection of traces by creating Prometheus-based metrics that track a variety of metrics such as: + +- Total span call counts +- Span latency histograms +- Total span size count + +The metrics-generator creates metrics which define the relationship between services via edges and nodes. +Each of these metrics are queryable using a set of Prometheus labels (key/value pairs). + +Each new value for a label increases the number of active series associated with a metric. (To learn more about active series, read the [Trace active series]({{< relref "./active-series" >}}) documentation.) + +This is also known as an increase in cardinality, and the number of active series generated for a metric is directly proportional to the number of labels that exist for that metrics alongside the number of values each label has added. + +In a non-modified instance of the metrics generator, a small number of labels are added automatically. +Because labels like `span_kind` and `status_code` only have a few valid values, the largest variable for the number of active series produced for each metric depends on the number of service names and span names associated with trace spans. + +The metrics-generator can also be configured to also add extra labels on metrics, using span attribute key/value pairs which are mapped directly to these labels see the [custom span attribute documentation]({{< relref "../configuration#metrics-generator" >}}). + +Be careful when configuring custom attributes: the greater the number of values seen in a specific attribute, the greater the number of active series will be produced. For more information about active series, refer to the [active series documentation]({{< relref "./active-series" >}}) + +Let's say that you are adding a custom attribute that includes unique customer IDs as a metrics label. If you have 100 customers, this could potentially multiple the number of active series generated by up to 100 (for example, going from 25,000 active series to 2.5M). +Always consider which attributes will actually be useful as labels for querying metrics, as well as the cardinality that they will increase metrics by. + +## Dry-running the metrics-generator + +An often most reliable solution is by running the metrics-generator in a dry-run mode. +Using the dry-run mode generates metrics but does not collecting them, thus not writing them to a metrics storage. +The override `metrics_generator_disable_collection` is defined for this use-case. + +To get an estimate, run the metrics-generator normally and set the override to `true`. +Then, check `tempo_metrics_generator_registry_active_series` to get an estimation of the active series for that set-up. \ No newline at end of file diff --git a/docs/sources/tempo/metrics-generator/service-graph-view.md b/docs/sources/tempo/metrics-generator/service-graph-view.md index 462c8947bdb..741ce7a961d 100644 --- a/docs/sources/tempo/metrics-generator/service-graph-view.md +++ b/docs/sources/tempo/metrics-generator/service-graph-view.md @@ -3,7 +3,7 @@ title: Service graph view menuTitle: Service graph view aliases: - /docs/tempo/latest/metrics-generator/app-performance-mgmt -weight: 200 +weight: 400 --- # Service graph view diff --git a/docs/sources/tempo/metrics-generator/service_graphs.md b/docs/sources/tempo/metrics-generator/service_graphs.md index 60a269f9190..f799107a151 100644 --- a/docs/sources/tempo/metrics-generator/service_graphs.md +++ b/docs/sources/tempo/metrics-generator/service_graphs.md @@ -3,7 +3,8 @@ aliases: - /docs/tempo/latest/server_side_metrics/service_graphs/ - /docs/tempo/latest/metrics-generator/service_graphs/ title: Service graphs -weight: 500 +description: Service graphs help you understand the structure of a distributed system and the connections and dependencies between its components. +weight: 300 --- # Service graphs @@ -25,7 +26,7 @@ and the connections and dependencies between its components: ## How they work -The metrics-generator processes traces and generates service graphs in the form of prometheus metrics. +The metrics-generator processes traces and generates service graphs in the form of Prometheus metrics. Service graphs work by inspecting traces and looking for spans with parent-children relationship that represent a request. The processor uses the [OpenTelemetry semantic conventions](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/README.md) to detect a myriad of requests. @@ -66,27 +67,28 @@ Since the service graph processor has to process both sides of an edge, it needs to process all spans of a trace to function properly. If spans of a trace are spread out over multiple instances, spans are not paired up reliably. -## Cardinality +## Estimate cardinality from traces Cardinality can pose a problem when you have lots of services. There isn't a direct formula or solution to this issue. -But the following guide should help estimate the cardinality that the feature will generate. +The following guide should help estimate the cardinality that the feature will generate. -### How to estimate the cardinality +For more information on cardinality, refer to the [Cardinality]({{< relref "./cardinality" >}}) documentation. -#### Cardinality from traces +### How to estimate the cardinality The amount of edges depends on the number of nodes in the system and the direction of the requests between them. Let’s call this amount hops. Every hop will be a unique combination of client + server labels. For example: - A system with 3 nodes `(A, B, C)` of which A only calls B and B only calls C will have 2 hops `(A → B, B → C)` -- A system with 3 nodes `(A, B, C)` that call each other (i.e. all bidirectional links somehow) will have 6 hops `(A → B, B → A, B → C, C → B, A → C, C → A)` +- A system with 3 nodes `(A, B, C)` that call each other (i.e., all bidirectional link) will have 6 hops `(A → B, B → A, B → C, C → B, A → C, C → A)` We can’t calculate the amount of hops automatically based upon the nodes, but it should be a value between `#services - 1` and `#services!`. -If we know the amount of hops in a system, we can calculate the cardinality of the generated service graphs: +If we know the amount of hops in a system, we can calculate the cardinality of the generated +[service graphs]({{< relref "./service_graphs" >}}): ``` traces_service_graph_request_total: #hops @@ -103,14 +105,9 @@ Finally, we get the following cardinality estimation: Sum: 8 * #hops + 2 * #services ``` -#### Dry-running the metrics-generator - -An often most reliable solution is by running the metrics-generator in a dry-run mode. -That is generating metrics but not collecting them, thus not writing them to a metrics storage. -The override `metrics_generator_disable_collection` is defined for this use-case. - -To get an estimate, run the metrics-generator normally and set the override to `true`. -Then, check `tempo_metrics_generator_registry_active_series` to get an estimation of the active series for that set-up. +{{% admonition type="note" %}} +To estimate the number of metrics, refer to the [Dry run metrics generator]({{< relref "./cardinality" >}}) documentation. +{{% /admonition %}} ## How to run @@ -118,14 +115,21 @@ Service graphs are generated in Tempo and pushed to a metrics storage. Then, they can be represented in Grafana as a graph. You will need those components to fully use service graphs. -### Tempo +{{% admonition type="note" %}} +Cardinality can pose a problem when you have lots of services. +To learn more about cardinality and how to perform a dry run of the metrics generator, see the [Cardinality documentation]({{< relref "./cardinality" >}}). +{{% /admonition %}} + +### Enable service graphs in Tempo/GET -To enable service graphs in Tempo/GET, enable the metrics generator and add an overrides section which enables the `service-graphs` generator. See [here for configuration details]({{< relref "../configuration/#metrics-generator" >}}). +To enable service graphs in Tempo/GET, enable the metrics generator and add an overrides section which enables the `service-graphs` generator. See [here for configuration details]({{< relref "../configuration#metrics-generator" >}}). -### Grafana +### Enable service graphs in Grafana -**Note** Since 9.0.4 service graphs have been enabled by default in Grafana. Prior to Grafana 9.0.4, service graphs were hidden -under the [feature toggle](https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/#feature_toggles) `tempoServiceGraph`. +{{% admonition type="note" %}} +Since Grafana 9.0.4, service graphs have been enabled by default. Prior to Grafana 9.0.4, service graphs were hidden +under the [feature toggle](/docs/grafana/latest/setup-grafana/configure-grafana/#feature_toggles) `tempoServiceGraph`. +{{% /admonition %}} Configure a Tempo data source's 'Service Graphs' by linking to the Prometheus backend where metrics are being sent: @@ -149,5 +153,4 @@ datasources: serviceMap: datasourceUid: 'prometheus' version: 1 -``` - +``` \ No newline at end of file diff --git a/docs/sources/tempo/metrics-generator/span_metrics.md b/docs/sources/tempo/metrics-generator/span_metrics.md index bec31b11236..a1d185cafb1 100644 --- a/docs/sources/tempo/metrics-generator/span_metrics.md +++ b/docs/sources/tempo/metrics-generator/span_metrics.md @@ -1,18 +1,20 @@ --- aliases: -- /docs/tempo/latest/server_side_metrics/span_metrics/ -- /docs/tempo/latest/metrics-generator/span_metrics/ -title: Generate metrics from spans -weight: 400 + - /docs/tempo/latest/server_side_metrics/span_metrics/ + - /docs/tempo/latest/metrics-generator/span_metrics/ +title: Span metrics +description: The span metrics processor generates metrics from ingested tracing data, including request, error, and duration (RED) metrics. +weight: 200 --- -# Generate metrics from spans +# Span metrics The span metrics processor generates metrics from ingested tracing data, including request, error, and duration (RED) metrics. Span metrics generate two metrics: -* A counter that computes requests -* A histogram that tracks the distribution of durations of all requests + +- A counter that computes requests +- A histogram that tracks the distribution of durations of all requests Span metrics are of particular interest if your system is not monitored with metrics, but it has distributed tracing implemented. @@ -22,14 +24,14 @@ Even if you already have metrics, span metrics can provide in-depth monitoring o The generated metrics will show application level insight into your monitoring, as far as tracing gets propagated through your applications. -Last but not least, span metrics lower the entry barrier for using [exemplars](https://grafana.com/docs/grafana/latest/basics/exemplars/). +Last but not least, span metrics lower the entry barrier for using [exemplars](/docs/grafana/latest/basics/exemplars/). An exemplar is a specific trace representative of measurement taken in a given time interval. Since traces and metrics co-exist in the metrics-generator, exemplars can be automatically added, providing additional value to these metrics. ## How to run -To enable service graphs in Tempo/GET, enable the metrics generator and add an overrides section which enables the `span-metrics` generator. See [here for configuration details]({{< relref "../configuration/#metrics-generator" >}}). +To enable service graphs in Tempo/GET, enable the metrics generator and add an overrides section which enables the `span-metrics` generator. See [here for configuration details]({{< relref "../configuration#metrics-generator" >}}). ## How it works @@ -38,12 +40,16 @@ Dimensions can be the service name, the operation, the span kind, the status cod This processor is designed with the goal to mirror the implementation from the OpenTelemetry Collector of the [processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/spanmetricsprocessor) with the same name. +{{% admonition type="note" %}} +To learn more about cardinality and how to perform a dry run of the metrics generator, see the [Cardinality documentation]({{< relref "./cardinality" >}}). +{{% /admonition %}} + ### Metrics The following metrics are exported: | Metric | Type | Labels | Description | -|--------------------------------|-----------|------------|------------------------------| +| ------------------------------ | --------- | ---------- | ---------------------------- | | traces_spanmetrics_latency | Histogram | Dimensions | Duration of the span | | traces_spanmetrics_calls_total | Counter | Dimensions | Total count of the span | | traces_spanmetrics_size_total | Counter | Dimensions | Total size of spans ingested | @@ -52,12 +58,32 @@ The following metrics are exported: In Tempo 1.4 and 1.4.1, the histogram metric was called `traces_spanmetrics_duration_seconds`. This was changed later to be consistent with the metrics generated by the Grafana Agent and the OpenTelemetry Collector. {{% /admonition %}} -By default, the metrics processor adds the following labels to each metric: `service`, `span_name`, `span_kind`, `status_code`, `status_message`. -Additional user defined labels can be created using the [`dimensions` configuration option]({{< relref "../configuration/#metrics-generator" >}}). +By default, the metrics processor adds the following labels to each metric: `service`, `span_name`, `span_kind`, `status_code`, `status_message`, `job`, and `instance`. + +- `service` - The name of the service that generated the span +- `span_name` - The unique name of the span +- `span_kind` - The type of span, this can be one of five values: + - `SPAN_KIND_SERVER` - The span was generated by a call from another service + - `SPAN_KIND_CLIENT` - The span made a call to another service + - `SPAN_KIND_INTERNAL` - The span does not have interaction outside of the service it was generated in + - `SPAN_KIND_PUBLISHER` - The span created data that was pushed onto a bus or message broker + - `SPAN_KIND_CONSUMER` - The span consumed data that was on a bus or messaging system +- `status_code` - The result of the span, this can be one of three values: + - `STATUS_CODE_UNSET` - Result of the span was unset/unknown + - `STATUS_CODE_OK` - The span operation completed successfully + - `STATUS_CODE_ERROR` - The span operation completed with an error +- `status_message` (optionally enabled) - The message that details the reason for the `status_code` label +- `job` - The name of the job, a combination of namespace and service; only added if `metrics_generator_processor_span_metrics_enable_target_info: true` +- `instance` - The instance ID; only added if `metrics_generator_processor_span_metrics_enable_target_info: true` + +Additional user defined labels can be created using the [`dimensions` configuration option]({{< relref "../configuration#metrics-generator" >}}). When a configured dimension collides with one of the default labels (e.g. `status_code`), the label for the respective dimension is prefixed with double underscore (i.e. `__status_code`). -If you use ratio based sampler you can use custom sampler below to not lose metric information, you also need to set `metrics_generator.processor.span_metrics.span_multiplier_key` to `"X-SampleRatio"` +Custom labeling of dimensions is also supported using the [`dimension_mapping` configuration option]({{< relref "../configuration#metrics-generator" >}}). + +An optional metric called `traces_target_info` using all resource level attributes as dimensions can be enabled in the [`enable_target_info` configuration option]({{< relref "../configuration#metrics-generator" >}}). +If you use a ratio-based sampler, you can use the custom sampler below to not lose metric information. However, you also need to set `metrics_generator.processor.span_metrics.span_multiplier_key` to `"X-SampleRatio"`. ```go package tracer @@ -95,4 +121,4 @@ func (ds RatioBasedSampler) Description() string { ## Example -

Span metrics overview

\ No newline at end of file +

Span metrics overview

diff --git a/docs/sources/tempo/operations/_index.md b/docs/sources/tempo/operations/_index.md index a5509922fea..a1b3896ebb5 100644 --- a/docs/sources/tempo/operations/_index.md +++ b/docs/sources/tempo/operations/_index.md @@ -1,7 +1,8 @@ --- -title: Manage +title: Manage Tempo menuTitle: Manage -weight: 600 +description: Learn how to manage and tune Tempo. +weight: 450 --- # Manage Tempo @@ -10,4 +11,4 @@ This section provides resources for managing and tuning Tempo. {{< section >}} -In addition, the [Tempo runbooks](https://github.com/grafana/tempo/blob/main/operations/tempo-mixin/runbook.md) can help with remediating operational issues. \ No newline at end of file +In addition, the [Tempo runbooks](https://github.com/grafana/tempo/blob/main/operations/tempo-mixin/runbook.md) can help with remediating operational issues. diff --git a/docs/sources/tempo/operations/best-practices.md b/docs/sources/tempo/operations/best-practices.md new file mode 100644 index 00000000000..5218445210c --- /dev/null +++ b/docs/sources/tempo/operations/best-practices.md @@ -0,0 +1,50 @@ +--- +title: Best practices for traces +menuTitle: Best practices +description: Learn about the best practices for traces +weight: 20 +--- + +# Best practices for traces + +This page provides some general best practices for tracing. + +## Span and resource attributes + +[Traces]({{< relref "../traces" >}}) are built from spans, which denote units of work such as a call to, or from, an upstream service. Spans are constructed primarily of span and resource attributes. +Spans also have a hierarchy, where parent spans can have children or siblings. + +In the screenshot below, the left side of the screen (1) shows the list of results for the query. The right side (2) lists each span that makes up the selected trace. + +

Trace example

+ +A **span attribute** is a key/value pair that provides context for its span. For example, if the span deals with calling another service via HTTP, an attribute could include the HTTP URL (maybe as the span attribute key `http.url`) and the HTTP status code returned (as the span attribute `http.status_code`). Span attributes can consist of varying, non-null types. + +Unlike a span attribute, a **resource attribute** is a key/value pair that describes the context of how the span was collected. Generally, these attributes describe the process that created the span. +For example, this could be a set of resource attributes concerning a Kubernetes cluster, in which case you may see resource attributes, for example: `k8s.namespace`, `k8s.container_name`, and `k8s.cluster`. +These can also include information on the libraries that were used to instrument the spans for a trace, or any other infrastructure information. + +For more information, read the [Attribute and Resource](https://opentelemetry.io/docs/specs/otel/overview/) sections in the OpenTelemetry specification. + +### Naming conventions for span and resource attributes + +When naming attributes, use consistent, nested namespaces to ensures that attribute keys are obvious to anyone observing the spans of a trace and that common attributes can be shared by spans. +Using our example from above, the `http` prefix of the attribute is the namespace, and `url` and `status_code` are keys within that namespace. +Attributes can also be nested, for example `http.url.protocol` might be `HTTP` or `HTTPS`, whereas `http.url.path` might be `/api/v1/query`. + +For more details around semantic naming conventions, refer to the [Recommendations for OpenTelemetry Authors](https://opentelemetry.io/docs/specs/otel/common/attribute-naming/#recommendations-for-opentelemetry-authors) documentation. + +Some third-party libraries provide auto-instrumentation that generate span and span attributes when included in a source base. + +For more information about instrumenting your app for tracing, refer to the [Instrument for distributed tracing](/docs/tempo/latest/getting-started/instrumentation/) documentation. + +## Determining where to add spans + +When instrumenting, determine the smallest piece of work that you need to observe in a trace to be of value to ensure that you don’t over (or under) instrument. + +Creating a new span for any work that has a relatively significant duration allows the observation of a trace to immediately show where significant amounts of time are spent during the processing of a request into your application or system. + +For example, adding a span for a call to another services (either instrumented or not) may take an unknown amount of time to complete, and therefore being able to separate this work shows when services are taking longer than expected. + +Adding a span for a piece of work that might call many other functions in a loop is a good signal of how long that loop is taking (you might add a span attribute that counts how many time the loop runs to determine if the duration is acceptable). +However, adding a span for each method or function call in that loop might not, as it might produce hundreds or thousands of worthless spans. \ No newline at end of file diff --git a/docs/sources/tempo/traceql/_index.md b/docs/sources/tempo/traceql/_index.md index d8e1e744f89..998be663a80 100644 --- a/docs/sources/tempo/traceql/_index.md +++ b/docs/sources/tempo/traceql/_index.md @@ -2,7 +2,7 @@ title: TraceQL menuTitle: TraceQL description: Learn about TraceQL, Tempo's query language for traces -weight: 450 +weight: 600 aliases: - /docs/tempo/latest/traceql/ keywords: @@ -18,18 +18,18 @@ Inspired by PromQL and LogQL, TraceQL is a query language designed for selecting - Span and resource attributes, timing, and duration - Basic aggregates: `count()`, `avg()`, `min()`, `max()`, and `sum()` -Read the blog post, "[Get to know TraceQL](https://grafana.com/blog/2023/02/07/get-to-know-traceql-a-powerful-new-query-language-for-distributed-tracing/)," for an introduction to TraceQL and its capabilities. +Read the blog post, "[Get to know TraceQL](/blog/2023/02/07/get-to-know-traceql-a-powerful-new-query-language-for-distributed-tracing/)," for an introduction to TraceQL and its capabilities. {{< vimeo 796408188 >}} -For information on where the language is headed, see [future work](architecture). -The TraceQL language uses similar syntax and semantics as [PromQL](https://grafana.com/blog/2020/02/04/introduction-to-promql-the-prometheus-query-language/) and [LogQL](https://grafana.com/docs/loki/latest/logql/), where possible. +For information on where the language is headed, see [future work]({{< relref "./architecture" >}}). +The TraceQL language uses similar syntax and semantics as [PromQL](/blog/2020/02/04/introduction-to-promql-the-prometheus-query-language/) and [LogQL](/docs/loki/latest/logql/), where possible. -TraceQL requires Tempo’s Parquet columnar format to be enabled. For information on enabling Parquet, refer to the [Apache Parquet backend](https://grafana.com/docs/tempo/latest/configuration/parquet/) Tempo documentation. +TraceQL requires Tempo’s Parquet columnar format to be enabled. For information on enabling Parquet, refer to the [Apache Parquet backend]({{< relref "..//configuration/parquet" >}}) Tempo documentation. ## TraceQL query editor -With Tempo 2.0, you can use the TraceQL query editor in the Tempo data source to build queries and drill-down into result sets. The editor is available in Grafana’s Explore interface. For more information, refer to [TraceQL query editor]({{< relref "query-editor" >}}). +With Tempo 2.0, you can use the TraceQL query editor in the Tempo data source to build queries and drill-down into result sets. The editor is available in Grafana’s Explore interface. For more information, refer to [TraceQL query editor]({{< relref "./query-editor" >}}).

Query editor showing request for http.method

@@ -100,7 +100,7 @@ Find any database connection string that goes to a Postgres or MySQL database: ### Unscoped attribute fields -Attributes can be unscoped if you are unsure if the requested attribute exists on the span or resource. When possible, use scoped instead of unscoped attributes. Scoped attributes provide faster query results. +Attributes can be unscoped if you are unsure if the requested attribute exists on the span or resource. When possible, use scoped instead of unscoped attributes. Scoped attributes provide faster query results. For example, to find traces with an attribute of `sla` set to `critical`: ``` @@ -189,7 +189,7 @@ So far, all of the example queries expressions have been about individual spans. - `min` - The min value of a given numeric attribute or intrinsic for a spanset. - `sum` - The sum value of a given numeric attribute or intrinsic for a spanset. -Aggregate functions allow you to carry out operations on matching results to further refine the traces returned. For more information on planned future work, refer to [How TraceQL works]({{< relref "architecture" >}}). +Aggregate functions allow you to carry out operations on matching results to further refine the traces returned. For more information on planned future work, refer to [How TraceQL works]({{< relref "./architecture" >}}). For example, to find traces where the total number of spans is greater than `10`: @@ -237,8 +237,8 @@ When using the same Grafana stack for multiple environments (e.g., `production` ``` { resource.service.namespace = "ecommerce" && - resource.service.name = "frontend" && - resource.deployment.environment = "production" && + resource.service.name = "frontend" && + resource.deployment.environment = "production" && name = "POST /api/orders" } ``` @@ -249,8 +249,8 @@ This example finds all traces on the operation `POST /api/orders` that have an e ``` { - resource.service.name="frontend" && - name = "POST /api/orders" && + resource.service.name="frontend" && + name = "POST /api/orders" && status = error } ``` @@ -259,8 +259,8 @@ This example finds all traces on the operation `POST /api/orders` that return wi ``` { - resource.service.name="frontend" && - name = "POST /api/orders" && + resource.service.name="frontend" && + name = "POST /api/orders" && span.http.status_code >= 500 } ``` @@ -276,8 +276,8 @@ This example locates all the traces of the `GET /api/products/{id}` operation th ### Find traces going through `production` and `staging` instances -This example finds traces that go through `production` and `staging` instances. -It's a convenient request to identify misconfigurations and leaks across production and non-production environments. +This example finds traces that go through `production` and `staging` instances. +It's a convenient request to identify misconfigurations and leaks across production and non-production environments. ``` { resource.deployment.environment = "production" } && { resource.deployment.environment = "staging" } @@ -302,3 +302,9 @@ Find any trace where any span has an `http.method` attribute set to `GET` as wel ``` { span.http.method = "GET" && status = ok } && { span.http.method = "DELETE" && status != ok } ``` + +Find any trace with a `deployment.environment` attribute that matches the regex `prod-.*` and `http.status_code` attribute set to `200`: + +``` +{ resource.deployment.environment =~ "prod-.*" && span.http.status_code = 200 } +``` diff --git a/docs/sources/tempo/traces.md b/docs/sources/tempo/traces.md index ce7207d9ee5..db2e65e4929 100644 --- a/docs/sources/tempo/traces.md +++ b/docs/sources/tempo/traces.md @@ -49,8 +49,37 @@ Traces can help you find bottlenecks. A trace can be visualized to give a graphi Metrics, logs, and traces form the three pillars of observability. Metrics provide the health data about the state of a system. Logs provide an audit trail of activity that create an informational context. Traces tell you what happens at each step or action in a data pathway. +## Tracing versus profiling + +Tracing provides an overview of tasks performed by an operation or set of work. +Profiling provides a code-level view of what was going on. +Generally, tracing is done at a much higher level specific to one transaction, and profiling is sampled over time, aggregated over many transactions. + +The superpower of tracing is seeing how a thing in one program invoked another program. + +The superpower of profiling is seeing function-level or line-level detail. + +For example, let’s say you want to gather trace data on how long it takes to enter and start a car. The trace would contain multiple spans: + +- Walking from the resident to the car +- Unlocking the car +- Adjusting the seat +- Starting the ignition + +This trace data is collected every time the car is entered and started. +You can track variations between each operation that can help pinpoint when issues happen. +If the driver forgot their keys, then that would show up as an outlying longer duration span. +In this same example, profiling gives the code stack, in minute detail: get-to-car invoked step-forward, which invoked lift-foot, which invoked contract-muscle, etc. +This extra detail provides the context that informs the data provided by a trace. + ## Terminology +Active series +: A time series that receives new data points or samples. + +Cardinality +: The total combination of key/value pairs, such as labels and label values for a given metric series or log stream, and how many unique combinations they generate. + Data source : A basic storage for data such as a database, a flat file, or even live references or measurements from a device. A file, database, or service that provides data. For example, traces data is imported into Grafana by configuring and enabling a Tempo data source.