POC: export profile metrics at compaction time #3718

alsoba13 · 2024-11-26T08:08:49Z

Prerequisites

This only works for pyroscope v2.
I've used an old version of the compactor, before merging feat(v2): background compaction cleanup #3694

Exporting profile metrics at compaction time

This PoC shows how could be export metrics from profiles at compaction time (in fact we do this right after compaction, not at compaction time).

Compaction is something that happens eventually in every block of our object storage. This approach offers some benefits over exporting at ingestion time, as described by tempo's members:

At ingestion time, two different write operations in the write path (profiles writes + metrics writes) makes error handling difficult - i.e. having to revert the first write after the second failed. Doing this at compaction time may let you write profiles first and handle metrics later.
At ingestion time, we would need a new stateful component, a WAL or state holder for exporting. When we do it at compaction time, data is already in object storage (our de facto WAL) when we export metrics of it.
At ingestion time, OOO samples need to be handled differently (discarding them or using current timestamp instead of real), while when we do this at compaction time, we accept some range of OOO: we will be exporting data of a wider time range.

In theory, the first level of compaction (L0 blocks to L1 blocks) is done shortly after the data ingestion (~10s). But in practice, I've observed that L0 compaction happens every 30-120s. I don't know the reasons of such delay (maybe data ingestion is low and compaction happen less often? I only ingest data of 1 tenant with 2 services - every 15s aprox)

Generated metrics

Now that we have a prototype running, we can get a picture of how generated metrics look like.

Dimensions

Every profile type or dimension is exported as a metric with this format:

pyroscope_exported_metrics_<profile_type>{...}

So for example, if a service writes profile data of 3 different __profile_type__, we will export 3 different metrics:

process_cpu:cpu:nanoseconds:cpu:nanoseconds
- exported as pyroscope_exported_metrics_process_cpu_cpu_nanoseconds_cpu_nanoseconds
memory:alloc_objects:count:space:bytes
- exported as pyroscope_exported_metrics_memory_alloc_objects_count_space_bytes
memory:alloc_space:bytes:space:bytes
- exported as pyroscope_exported_metrics_memory_alloc_space_bytes_space_bytes

Labels are preserved, unrolling new series for each labelset. So we can query for CPU of a specific pod of a service and some other pprof label like this:

pyroscope_exported_metrics_process_cpu_cpu_nanoseconds_cpu_nanoseconds{service_name="my-service", 
pod="my-pod", my_pprof_label="some-value"}

Dimensions metrics are exported for every tenant and every service_name, but this should be configurable by the user.

Functions

This prototype explores also the ability to export metrics on specific functions. We can chose an interesting function to export.

Now it's exporting data for every dimension of the given function under this format:

pyroscope_exported_metrics_functions_<profile_type>{function="exported-function", ...}

In this prototype I've hardcoded Garbage colector and HTTP functions to export, for every service_name. I haven't make distinction on tenant yet. The functions to export should come from config (UI is a must here).

In the future, we could specify a filter of LabelSets instead of exporting by service_name. So for example "foo": "{}" would export every profile of foo function. And "foo": "{service_name=\"my-service\", vehicle=\"bike\"}" would export only for that service_name and vehicle.

Detected challenges

This naive solution is full of trade-offs and assumptions and it's far from being a final solution. I've detected some challenges:

Exporting: we need to export data somewhere, and this should be configurable. I'm not sure how it works for grafana cloud but we may need to grab credentials from grafana-com.
When to do it: I don't think exporting in the compaction worker at the end of every L0 compaction is the correct place to do this. We could do it as a background task triggered by the metadata manager every time a L1 block has been created.
- Edit: @kolesnikovae proposes to do it while compacting, not with a query after compaction happened (my approach). Yet to explore how to work that out.
Error handling: I haven't done any error handling here. This naive solution is a best effort at compaction time, but error handling can be tricky here: what happens if some L1 blocks aren't successfully processed. Can we stop L1->L2 compaction (done after 100s) until we retry metrics exporting. Can we do it later on L2 blocks? if so, how do we identify non-exported data inside the compacted L2 blocks?
Handling functions metrics is inefficient: we don't have index for functions to find them quickly

DEMO

I have a pyroscope with the changes running in my machine while exporting metrics to my grafana cloud instance.

Go grant yourself privileges in the admin page:

You can take a look on exported metrics here:
https://albertosotogcp.grafana.net/explore/metrics/trail?from=now-1h&to=now&timezone=browser&var-ds=grafanacloud-prom&var-otel_resources=&var-filters=&var-deployment_environment=&metricSearch=pyroscope_&metricPrefix=all
You can see a demo dashboard, were I tried to simulate an alert of >20% of CPU of garbage collection or >60% of memory in HTTP requests:
https://albertosotogcp.grafana.net/goto/eFRA8E7NR?orgId=1

alsoba13 · 2024-11-26T08:11:10Z

pkg/experiment/compactor/compaction_worker.go

+	functions := map[string]*queryv1.FunctionList{
+		// TODO:
+		// This must be richer. First, it should be split by tenant.
+		// Also, we could have functions associated to service_name
+		// while others are just collected generally no matter what
+		// service_name we handle
+		"pyroscope": {
+			Functions: []string{
+				"net/http.HandlerFunc.ServeHTTP",
+				"runtime.gcBgMarkWorker",
+			},
+		},
+		"ride-sharing-app": {
+			Functions: []string{
+				"net/http.HandlerFunc.ServeHTTP",
+				"runtime.gcBgMarkWorker",
+			},
+		},
+	}


Metrics to export need to be from config.

This PoC export every dimension/profile-type of every tenant and service name, and the functions configured in this map. This should all be configurable.

alsoba13 · 2024-11-26T08:11:38Z

pkg/experiment/compactor/compaction_worker.go

+		reader := query_backend.NewBlockReader(w.logger, w.storage)
+		var res, _ = reader.Invoke(ctx,
+			&queryv1.InvokeRequest{
+				Tenant:    []string{c.TenantId},
+				StartTime: c.MinTime,
+				EndTime:   c.MaxTime,
+				Query: []*queryv1.Query{{
+					QueryType: queryv1.QueryType_QUERY_METRICS,
+					Metrics: &queryv1.MetricsQuery{
+						FunctionsByServiceName: functions,
+					},
+				}},
+				QueryPlan: &queryv1.QueryPlan{
+					Root: &queryv1.QueryNode{
+						Blocks: []*metastorev1.BlockMeta{c},
+					},
+				},
+				LabelSelector: "{}",
+			},
+		)


New query type "Metrics" help with this task

alsoba13 · 2024-11-26T08:12:18Z

pkg/experiment/compactor/exporter.go

Barely the same as Anton's exporter. #2899

alsoba13 · 2024-11-26T08:12:39Z

pkg/experiment/compactor/exporter.go

+				Username: "1741027",
+				Password: "omitted",


This was exporting to my grafana cloud instance. But this should come from config, either a single target datasource for OS or integrated with grafana cloud stacks

alsoba13 · 2024-11-26T08:15:05Z

pkg/experiment/query_backend/query_profile_entry.go

+		if len(by) == 0 {
+			fp, err = reader.Series(postings.At(), &l, &chunks)
+		} else {
+			fp, err = reader.SeriesBy(postings.At(), &l, &chunks, by...)


This function was deleting my labels and was keeping me from fetching them, so I used Series instead when by is empty.

kolesnikovae · 2024-11-26T09:44:08Z

pkg/experiment/compactor/compaction_worker.go

+	for _, c := range compacted {
+		reader := query_backend.NewBlockReader(w.logger, w.storage)
+		var res, _ = reader.Invoke(ctx,


This is quite an original solution. However, to my mind, the only reason we're implementing this in compactor is that we don't want to query data. I might be missing something but this is what we're actually doing here: querying our compacted blocks.

I propose an alternative approach:

We inject our component that extracts data from samples to the block merge process here.

For every label set we encounter while handling the stream (a unique Series ID after the indexRewriter.rewriteRow call): we check if it matches any of the rules (we still want to support label filters)

If series labels (ProfileEntry.Labels) match any rule, handle grouping to get the labels we want to preserve in the output time series (metrics). A conjunction of the group by label values and the rule name identifies the aggregator for the time series.

We add function/call-chain filtering later. For now we focus on the core functionality.

Basically, something very similar to our downsampler, but simpler, as we don't need to handle writes to aux tables.

If we want to generate metrics with "weighted" values, where the values are fractions of the profile total (and we do, I believe), we need to do some more work. We need to get the profile total, not the row total, which is a single series sample set, while a profile may include multiple series. To get a profile total, we need to aggregate row totals by the ProfileID column: every profile has a unique ID, and as of v2, it's guaranteed, that all the profile samples will be present in the same table, regardless of their labels.

I suggest that we add it later, together with function/call-chain filtering.

kolesnikovae · 2024-11-26T09:45:48Z

api/query/v1/query.proto

@@ -83,6 +83,7 @@ message Query {
  TimeSeriesQuery time_series = 5;
  TreeQuery tree = 6;
  PprofQuery pprof = 7;
+  MetricsQuery metrics = 8;


I think we can implement the feature without extending the query API.

kolesnikovae · 2024-11-26T09:50:21Z

pkg/experiment/query_backend/query_metrics.go

+
+	labelsFromFp := make(map[uint64]phlaremodel.Labels)
+	builders := make(map[uint64]map[string]*phlaremodel.TimeSeriesBuilder)
+	for rows.Next() {


I'm afraid this loop will suffer from very bad performance. Let's skip function filtering for now and add it later – we already have stack trace filter that can handle that, we only need to integrate it.

kolesnikovae · 2024-11-26T11:25:55Z

Regarding the compaction lag: it's totally possible that L0 compaction is delayed because we only compact data once we accumulated enough blocks. However, in the PR you mentioned, we added a parameter that controls for how long a block might be staged.

We also introduced an indicator – time to compaction. In our dev env, L0 compaction lag does not exceed 1m and p99 is around 15-20 seconds.

I think that relying on the "current" time might be dangerous – we could explore an option where we get timestamp of the blocks (the time they are created). Also, I think that OOO ingestion is almost inevitable: jobs might be run concurrently, and their order is not guaranteed (we don't need it for compaction); usually, this is not an issue, but if the job fails and we retry (which we do), we will likely violate the order. Fortunately, both Mimit and Prometheus handle OOO (with some caveats)

feat(PoC): export profile metrics at compaction time

c5e479d

alsoba13 requested a review from a team as a code owner November 26, 2024 08:08

alsoba13 commented Nov 26, 2024

View reviewed changes

pkg/experiment/compactor/exporter.go

Copy link

Contributor Author

alsoba13 Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Barely the same as Anton's exporter. #2899

alsoba13 commented Nov 26, 2024

View reviewed changes

kolesnikovae reviewed Nov 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: export profile metrics at compaction time #3718

POC: export profile metrics at compaction time #3718

alsoba13 commented Nov 26, 2024 •

edited

Loading

alsoba13 Nov 26, 2024

alsoba13 Nov 26, 2024

alsoba13 Nov 26, 2024

alsoba13 Nov 26, 2024 •

edited

Loading

alsoba13 Nov 26, 2024 •

edited

Loading

kolesnikovae Nov 26, 2024 •

edited

Loading

kolesnikovae Nov 26, 2024

kolesnikovae Nov 26, 2024 •

edited

Loading

kolesnikovae commented Nov 26, 2024

POC: export profile metrics at compaction time #3718

Are you sure you want to change the base?

POC: export profile metrics at compaction time #3718

Conversation

alsoba13 commented Nov 26, 2024 • edited Loading

Prerequisites

Exporting profile metrics at compaction time

Generated metrics

Dimensions

Functions

Detected challenges

DEMO

alsoba13 Nov 26, 2024

Choose a reason for hiding this comment

alsoba13 Nov 26, 2024

Choose a reason for hiding this comment

alsoba13 Nov 26, 2024

Choose a reason for hiding this comment

alsoba13 Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

alsoba13 Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

kolesnikovae Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

kolesnikovae Nov 26, 2024

Choose a reason for hiding this comment

kolesnikovae Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

kolesnikovae commented Nov 26, 2024

alsoba13 commented Nov 26, 2024 •

edited

Loading

alsoba13 Nov 26, 2024 •

edited

Loading

alsoba13 Nov 26, 2024 •

edited

Loading

kolesnikovae Nov 26, 2024 •

edited

Loading

kolesnikovae Nov 26, 2024 •

edited

Loading