diff --git a/docs/sources/tempo/troubleshooting/_index.md b/docs/sources/tempo/troubleshooting/_index.md index 7d22978b758..752db1d458b 100644 --- a/docs/sources/tempo/troubleshooting/_index.md +++ b/docs/sources/tempo/troubleshooting/_index.md @@ -26,7 +26,8 @@ In addition, the [Tempo runbook](https://github.com/grafana/tempo/blob/main/oper - [Queries fail with 500 and "error using pageFinder"]({{< relref "./bad-blocks" >}}) - [I can search traces, but there are no service name or span name values available]({{< relref "./search-tag" >}}) - [Error message `response larger than the max ( vs )`]({{< relref "./response-too-large" >}}) +- [Search results don't match trace lookup results with long-running traces]({{< relref "./long-running-traces" >}}) ## Metrics Generator -- [Metrics or service graphs seem incomplete]({{< relref "./metrics-generator" >}}) \ No newline at end of file +- [Metrics or service graphs seem incomplete]({{< relref "./metrics-generator" >}}) diff --git a/docs/sources/tempo/troubleshooting/long-running-traces.md b/docs/sources/tempo/troubleshooting/long-running-traces.md new file mode 100644 index 00000000000..2c58f694cfc --- /dev/null +++ b/docs/sources/tempo/troubleshooting/long-running-traces.md @@ -0,0 +1,62 @@ +--- +title: Long-running traces +description: Troubleshoot search results when using long-running traces +weight: 479 +aliases: + - ../operations/troubleshooting/long-running-traces/ +--- + +# Long-running traces + +Long-running traces are created when Tempo receives spans for a trace, +followed by a delay, and then Tempo receives additional spans for the same +trace. If the delay between spans is great enough, the spans end up in +different blocks, which can lead to inconsistency in a few ways: + +1. When using TraceQL search, the duration information only pertains to a + subset of the blocks that contain a trace. This happens because Tempo + consults only enough blocks to know the TraceID of the matching spans. When + performing a TraceID lookup, Tempo searches for all parts of a trace in all + matching blocks, which yields greater accuracy when combined. + +1. When using [`spanset` + operators](https://grafana.com/docs/tempo/latest/traceql/#combining-spansets), + Tempo only evaluates the contiguous trace of the current block. This means + that for a single block the conditions may evaluate to false, but to + consider all parts of the trace from all blocks would evaluate true. + +You can tune the `ingester.trace_idle_period` configuration to allow for +greater control about when traces are written to a block. Extending this beyond +the default `10s` can allow for long running trace to be co-located in the same +block, but take into account other considerations around memory consumption on +the ingesters. Currently this setting isn't per-tenant, and so adjusting +affects all ingester instances. + +Tempo publishes a `tempo_warnings_total` metric from several components, which +can aid in understanding when this situation arises. In particular, the following query can be used to know what percentage of traces which are flushed to the wall are connected. + +``` +1 - sum(rate(tempo_warnings_total{reason="disconnected_trace_flushed_to_wal"}[5m])) / sum(rate(tempo_ingester_traces_created_total{}[5m])) +``` + +If you have long-running traces, you may also be interested in the +`rootless_trace_flushed_to_wal` reason to know when a trace is flushed to the +wall without a root trace. + +You can use `reason` fields for discovery with this query: + +``` +sum(rate(tempo_warnings_total{}[5m])) by (reason) +``` + +In general, Tempo functions at its peak when all parts of a trace are stored +within as few blocks as possible. There is a wide variety of tracing patterns +in the wild, which makes it impossible to optimize for all of them. + +While the preceding information can help determine what Tempo is doing, it may +be worth modifying the usage pattern slightly. For example, you may want to use +[span +links](https://opentelemetry.io/docs/concepts/signals/traces/#span-links), so +that traces are split up, allowing one trace to complete, while pointing to the +next trace in the causal chain . This allows both traces to finish in a +shorter duration, and increase the chances of ending up in the same block.