From e00a0b052cb45c5106a97181dd2f6b2a57214884 Mon Sep 17 00:00:00 2001 From: Kim Nylander Date: Thu, 22 Jun 2023 14:21:58 -0400 Subject: [PATCH 1/4] Updates from backport 26563 --- .../tempo/metrics-generator/active-series.md | 4 +- .../tempo/operations/best-practices.md | 60 ++++++------------- docs/sources/tempo/traces.md | 29 +++++++++ 3 files changed, 48 insertions(+), 45 deletions(-) diff --git a/docs/sources/tempo/metrics-generator/active-series.md b/docs/sources/tempo/metrics-generator/active-series.md index 36f75721531..33776a5214b 100644 --- a/docs/sources/tempo/metrics-generator/active-series.md +++ b/docs/sources/tempo/metrics-generator/active-series.md @@ -9,12 +9,12 @@ weight: 100 # Active series -An active series is a time series that receives new data points or samples. When you stop writing new datapoints to a time series, shortly afterwards it is no longer considered active. +An active series is a time series that receives new data points or samples. When you stop writing new data points to a time series, shortly afterwards it is no longer considered active. Metrics generated by Tempo's metrics generator can provide both RED (Rate/Error/Duration) metrics and interdependency graphs between services in a trace (the Service Graph functionality in Grafana). These capabilities rely on a set of generated span metrics and service metrics. -Any spans that are ingested by Tempo could potentially create up to 13 metrics. However, this doesn't mean that every time a span is ingested that a new active series is created. +Any spans that are ingested by Tempo can create many metric series. However, this doesn't mean that every time a span is ingested that a new active series is created. The number of active series generated depends on the label pairs generated from span data that are associated with the metrics, similar to other Prometheus-formated data. diff --git a/docs/sources/tempo/operations/best-practices.md b/docs/sources/tempo/operations/best-practices.md index f660e93cbe2..5218445210c 100644 --- a/docs/sources/tempo/operations/best-practices.md +++ b/docs/sources/tempo/operations/best-practices.md @@ -11,66 +11,40 @@ This page provides some general best practices for tracing. ## Span and resource attributes -Traces are built from spans. Spans are constructed primarily of span and resource attributes. +[Traces]({{< relref "../traces" >}}) are built from spans, which denote units of work such as a call to, or from, an upstream service. Spans are constructed primarily of span and resource attributes. +Spans also have a hierarchy, where parent spans can have children or siblings. -A **span attribute** is a key/value pair that exposes context for the span that it exists within. For example, if the span deals with calling another service via HTTP, it could include the HTTP URL (maybe as the span attribute key `http.url`) and the HTTP status code returned (as the span attribute `http.status_code`). Span attributes can consist of varying, non-null types. +In the screenshot below, the left side of the screen (1) shows the list of results for the query. The right side (2) lists each span that makes up the selected trace. -Unlike a span attribute, a **resource attribute** is a key/value pair that is concerned around the context of the manner in which the span was collected. -For example, this could a set of resource attributes concerning a Kubernetes cluster, in which case you may see resource attributes, for example: `k8s.namespace`, `k8s.container_name`, and `k8s.cluster`. -These can also include information on the libraries that were used to instrument the spans for a trace, or any other infrastructure information. +

Trace example

-For more information, read the [Attribute and Resource](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/overview.md) sections in the OpenTelemetry specification. +A **span attribute** is a key/value pair that provides context for its span. For example, if the span deals with calling another service via HTTP, an attribute could include the HTTP URL (maybe as the span attribute key `http.url`) and the HTTP status code returned (as the span attribute `http.status_code`). Span attributes can consist of varying, non-null types. + +Unlike a span attribute, a **resource attribute** is a key/value pair that describes the context of how the span was collected. Generally, these attributes describe the process that created the span. +For example, this could be a set of resource attributes concerning a Kubernetes cluster, in which case you may see resource attributes, for example: `k8s.namespace`, `k8s.container_name`, and `k8s.cluster`. +These can also include information on the libraries that were used to instrument the spans for a trace, or any other infrastructure information. +For more information, read the [Attribute and Resource](https://opentelemetry.io/docs/specs/otel/overview/) sections in the OpenTelemetry specification. ### Naming conventions for span and resource attributes -When naming attributes, it is best to use consistent, nested namespaces. -This ensures that attribute keys will be obvious to anyone observing the spans of a trace, and that common attributes can be shared by spans. -Using our example from above, the `http` prefix of the attribute is the namespace, with `url` and `status_code` being keys within that namespace. -These can also be nested, for example `http.url.protocol` might be `HTTP` or `HTTPS`, whereas `http.url.path` might be `/api/v1/query`. +When naming attributes, use consistent, nested namespaces to ensures that attribute keys are obvious to anyone observing the spans of a trace and that common attributes can be shared by spans. +Using our example from above, the `http` prefix of the attribute is the namespace, and `url` and `status_code` are keys within that namespace. +Attributes can also be nested, for example `http.url.protocol` might be `HTTP` or `HTTPS`, whereas `http.url.path` might be `/api/v1/query`. -There are more details around semantic naming conventions which should be followed at the following link: https://opentelemetry.io/docs/specs/otel/common/attribute-naming/#recommendations-for-opentelemetry-authors +For more details around semantic naming conventions, refer to the [Recommendations for OpenTelemetry Authors](https://opentelemetry.io/docs/specs/otel/common/attribute-naming/#recommendations-for-opentelemetry-authors) documentation. -Some third-party libraries already provide auto-instrumentation that generate span and span attributes when included in a source base. -This alleviates the need for you to add spans and attributes for calling those libraries. +Some third-party libraries provide auto-instrumentation that generate span and span attributes when included in a source base. For more information about instrumenting your app for tracing, refer to the [Instrument for distributed tracing](/docs/tempo/latest/getting-started/instrumentation/) documentation. - ## Determining where to add spans -Spans make up a trace, where a trace is essentially just a meta ID that groups spans together. -Spans themselves denote units of work. This could be something that carries out some work within a service, or it could be a call from, or to, another service that is upstream or downstream. -Spans also have a hierarchy, where parent spans can have children or siblings. - When instrumenting, determine the smallest piece of work that you need to observe in a trace to be of value to ensure that you don’t over (or under) instrument. -In general, when manually instrumenting, create a new span for any work that has a relatively significant duration. This allows the observation of a trace to immediately show where significant amounts of time are spent during the processing of a request into your application or system. +Creating a new span for any work that has a relatively significant duration allows the observation of a trace to immediately show where significant amounts of time are spent during the processing of a request into your application or system. For example, adding a span for a call to another services (either instrumented or not) may take an unknown amount of time to complete, and therefore being able to separate this work shows when services are taking longer than expected. Adding a span for a piece of work that might call many other functions in a loop is a good signal of how long that loop is taking (you might add a span attribute that counts how many time the loop runs to determine if the duration is acceptable). -However, adding a span for each method or function call in that loop might not, as it might produce hundreds or thousands of spans that are essentially of no individual value. - -## Tracing versus profiling - -Tracing provides an overview of tasks performed by an operation or set of work. -Profiling provides a code-level view of what was going on. -Generally, tracing is done at a much higher level specific to one transaction, and profiling is sampled over time, aggregated over many transactions. - -The superpower of tracing is seeing how a thing in one program invoked another program. - -The superpower of profiling is seeing function-level or line-level detail. - -For example, let’s say you want to gather trace data on how long it takes to enter and start a car. The trace would contain multiple spans: - -- Walking from the resident to the car -- Unlocking the car -- Adjusting the seat -- Starting the ignition - -This trace data is collected every time the car is entered and started. -You can track variations between each operation that can help pinpoint when issues happen. -If the driver forgot their keys, then that would show up as an outlying longer duration span. -In this same example, profiling gives the code stack, in minute detail: get-to-car invoked step-forward, which invoked lift-foot, which invoked contract-muscle, etc. -This extra detail provides the context that informs the data provided by a trace. +However, adding a span for each method or function call in that loop might not, as it might produce hundreds or thousands of worthless spans. \ No newline at end of file diff --git a/docs/sources/tempo/traces.md b/docs/sources/tempo/traces.md index ce7207d9ee5..db2e65e4929 100644 --- a/docs/sources/tempo/traces.md +++ b/docs/sources/tempo/traces.md @@ -49,8 +49,37 @@ Traces can help you find bottlenecks. A trace can be visualized to give a graphi Metrics, logs, and traces form the three pillars of observability. Metrics provide the health data about the state of a system. Logs provide an audit trail of activity that create an informational context. Traces tell you what happens at each step or action in a data pathway. +## Tracing versus profiling + +Tracing provides an overview of tasks performed by an operation or set of work. +Profiling provides a code-level view of what was going on. +Generally, tracing is done at a much higher level specific to one transaction, and profiling is sampled over time, aggregated over many transactions. + +The superpower of tracing is seeing how a thing in one program invoked another program. + +The superpower of profiling is seeing function-level or line-level detail. + +For example, let’s say you want to gather trace data on how long it takes to enter and start a car. The trace would contain multiple spans: + +- Walking from the resident to the car +- Unlocking the car +- Adjusting the seat +- Starting the ignition + +This trace data is collected every time the car is entered and started. +You can track variations between each operation that can help pinpoint when issues happen. +If the driver forgot their keys, then that would show up as an outlying longer duration span. +In this same example, profiling gives the code stack, in minute detail: get-to-car invoked step-forward, which invoked lift-foot, which invoked contract-muscle, etc. +This extra detail provides the context that informs the data provided by a trace. + ## Terminology +Active series +: A time series that receives new data points or samples. + +Cardinality +: The total combination of key/value pairs, such as labels and label values for a given metric series or log stream, and how many unique combinations they generate. + Data source : A basic storage for data such as a database, a flat file, or even live references or measurements from a device. A file, database, or service that provides data. For example, traces data is imported into Grafana by configuring and enabling a Tempo data source. From f5ffb3c37c944585c9f735c37e6d4b0a6420627f Mon Sep 17 00:00:00 2001 From: Kim Nylander Date: Fri, 23 Jun 2023 14:33:11 -0400 Subject: [PATCH 2/4] Added metrics_ingestion_time_range_slack --- docs/sources/tempo/configuration/_index.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/docs/sources/tempo/configuration/_index.md b/docs/sources/tempo/configuration/_index.md index 715a8508f80..44878e58158 100644 --- a/docs/sources/tempo/configuration/_index.md +++ b/docs/sources/tempo/configuration/_index.md @@ -243,8 +243,14 @@ ingester: For more information on configuration options, see [here](https://github.com/grafana/tempo/blob/main/modules/generator/config.go). The metrics-generator processes spans and write metrics using the Prometheus remote write protocol. +For more information on the metrics-generator, refer to the [Metrics-generator documentation]({{<> relref "../metrics-generator" >}}). + +Metrics-generator processors are disabled by default. To enable it for a specific tenant, set `metrics_generator_processors` in the [overrides](#overrides) section. + +You can limit spans with start times that occur within a configured duration to be considered in metrics generation using `metrics_ingestion_time_range_slack`. +In Grafana Cloud, this value defaults to 30 seconds so all spans sent to the metrics-generation more than 30 seconds in the past are discarded or rejected. + -Metrics-generator processors are disabled by default. To enable it for a specific tenant set `metrics_generator_processors` in the [overrides](#overrides) section. ```yaml # Metrics-generator configuration block @@ -361,8 +367,8 @@ metrics_generator: [- ] # This option only allows spans with start time that occur within the configured duration to be - # considered in metrics generation - # This is to filter out spans that are outdated + # considered in metrics generation. + # This is to filter out spans that are outdated. [metrics_ingestion_time_range_slack: | default = 30s] ``` From e583f92e2fc818d72a94e7402257f17105667f37 Mon Sep 17 00:00:00 2001 From: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com> Date: Fri, 23 Jun 2023 17:42:23 -0400 Subject: [PATCH 3/4] Apply suggestions from code review --- docs/sources/tempo/configuration/_index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/sources/tempo/configuration/_index.md b/docs/sources/tempo/configuration/_index.md index 44878e58158..93fe990de7b 100644 --- a/docs/sources/tempo/configuration/_index.md +++ b/docs/sources/tempo/configuration/_index.md @@ -243,7 +243,7 @@ ingester: For more information on configuration options, see [here](https://github.com/grafana/tempo/blob/main/modules/generator/config.go). The metrics-generator processes spans and write metrics using the Prometheus remote write protocol. -For more information on the metrics-generator, refer to the [Metrics-generator documentation]({{<> relref "../metrics-generator" >}}). +For more information on the metrics-generator, refer to the [Metrics-generator documentation]({{< relref "../metrics-generator" >}}). Metrics-generator processors are disabled by default. To enable it for a specific tenant, set `metrics_generator_processors` in the [overrides](#overrides) section. @@ -366,7 +366,7 @@ metrics_generator: remote_write: [- ] - # This option only allows spans with start time that occur within the configured duration to be + # This option only allows spans with end times that occur within the configured duration to be # considered in metrics generation. # This is to filter out spans that are outdated. [metrics_ingestion_time_range_slack: | default = 30s] From 0b1a552e722d02d2f37a6f4334a0947c090a589b Mon Sep 17 00:00:00 2001 From: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com> Date: Fri, 23 Jun 2023 17:42:49 -0400 Subject: [PATCH 4/4] Update docs/sources/tempo/configuration/_index.md --- docs/sources/tempo/configuration/_index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/sources/tempo/configuration/_index.md b/docs/sources/tempo/configuration/_index.md index 93fe990de7b..57472d6da12 100644 --- a/docs/sources/tempo/configuration/_index.md +++ b/docs/sources/tempo/configuration/_index.md @@ -247,7 +247,7 @@ For more information on the metrics-generator, refer to the [Metrics-generator d Metrics-generator processors are disabled by default. To enable it for a specific tenant, set `metrics_generator_processors` in the [overrides](#overrides) section. -You can limit spans with start times that occur within a configured duration to be considered in metrics generation using `metrics_ingestion_time_range_slack`. +You can limit spans with end times that occur within a configured duration to be considered in metrics generation using `metrics_ingestion_time_range_slack`. In Grafana Cloud, this value defaults to 30 seconds so all spans sent to the metrics-generation more than 30 seconds in the past are discarded or rejected.