Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: export profile metrics at compaction time #3718

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

alsoba13
Copy link
Contributor

@alsoba13 alsoba13 commented Nov 26, 2024

Prerequisites

Exporting profile metrics at compaction time

This PoC shows how could be export metrics from profiles at compaction time (in fact we do this right after compaction, not at compaction time).

Compaction is something that happens eventually in every block of our object storage. This approach offers some benefits over exporting at ingestion time, as described by tempo's members:

  • At ingestion time, two different write operations in the write path (profiles writes + metrics writes) makes error handling difficult - i.e. having to revert the first write after the second failed. Doing this at compaction time may let you write profiles first and handle metrics later.
  • At ingestion time, we would need a new stateful component, a WAL or state holder for exporting. When we do it at compaction time, data is already in object storage (our de facto WAL) when we export metrics of it.
  • At ingestion time, OOO samples need to be handled differently (discarding them or using current timestamp instead of real), while when we do this at compaction time, we accept some range of OOO: we will be exporting data of a wider time range.

In theory, the first level of compaction (L0 blocks to L1 blocks) is done shortly after the data ingestion (~10s). But in practice, I've observed that L0 compaction happens every 30-120s. I don't know the reasons of such delay (maybe data ingestion is low and compaction happen less often? I only ingest data of 1 tenant with 2 services - every 15s aprox)

Generated metrics

Now that we have a prototype running, we can get a picture of how generated metrics look like.

Dimensions

Every profile type or dimension is exported as a metric with this format:

pyroscope_exported_metrics_<profile_type>{...}

So for example, if a service writes profile data of 3 different __profile_type__, we will export 3 different metrics:

  • process_cpu:cpu:nanoseconds:cpu:nanoseconds
    • exported as pyroscope_exported_metrics_process_cpu_cpu_nanoseconds_cpu_nanoseconds
  • memory:alloc_objects:count:space:bytes
    • exported as pyroscope_exported_metrics_memory_alloc_objects_count_space_bytes
  • memory:alloc_space:bytes:space:bytes
    • exported as pyroscope_exported_metrics_memory_alloc_space_bytes_space_bytes

Labels are preserved, unrolling new series for each labelset. So we can query for CPU of a specific pod of a service and some other pprof label like this:

pyroscope_exported_metrics_process_cpu_cpu_nanoseconds_cpu_nanoseconds{service_name="my-service", 
pod="my-pod", my_pprof_label="some-value"}

Dimensions metrics are exported for every tenant and every service_name, but this should be configurable by the user.

Functions

This prototype explores also the ability to export metrics on specific functions. We can chose an interesting function to export.

Now it's exporting data for every dimension of the given function under this format:

pyroscope_exported_metrics_functions_<profile_type>{function="exported-function", ...}

In this prototype I've hardcoded Garbage colector and HTTP functions to export, for every service_name. I haven't make distinction on tenant yet. The functions to export should come from config (UI is a must here).

In the future, we could specify a filter of LabelSets instead of exporting by service_name. So for example "foo": "{}" would export every profile of foo function. And "foo": "{service_name=\"my-service\", vehicle=\"bike\"}" would export only for that service_name and vehicle.

Detected challenges

This naive solution is full of trade-offs and assumptions and it's far from being a final solution. I've detected some challenges:

  • Exporting: we need to export data somewhere, and this should be configurable. I'm not sure how it works for grafana cloud but we may need to grab credentials from grafana-com.
  • When to do it: I don't think exporting in the compaction worker at the end of every L0 compaction is the correct place to do this. We could do it as a background task triggered by the metadata manager every time a L1 block has been created.
    • Edit: @kolesnikovae proposes to do it while compacting, not with a query after compaction happened (my approach). Yet to explore how to work that out.
  • Error handling: I haven't done any error handling here. This naive solution is a best effort at compaction time, but error handling can be tricky here: what happens if some L1 blocks aren't successfully processed. Can we stop L1->L2 compaction (done after 100s) until we retry metrics exporting. Can we do it later on L2 blocks? if so, how do we identify non-exported data inside the compacted L2 blocks?
  • Handling functions metrics is inefficient: we don't have index for functions to find them quickly

DEMO

I have a pyroscope with the changes running in my machine while exporting metrics to my grafana cloud instance.

Go grant yourself privileges in the admin page:

@alsoba13 alsoba13 requested a review from a team as a code owner November 26, 2024 08:08
Comment on lines +306 to +324
functions := map[string]*queryv1.FunctionList{
// TODO:
// This must be richer. First, it should be split by tenant.
// Also, we could have functions associated to service_name
// while others are just collected generally no matter what
// service_name we handle
"pyroscope": {
Functions: []string{
"net/http.HandlerFunc.ServeHTTP",
"runtime.gcBgMarkWorker",
},
},
"ride-sharing-app": {
Functions: []string{
"net/http.HandlerFunc.ServeHTTP",
"runtime.gcBgMarkWorker",
},
},
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics to export need to be from config.

This PoC export every dimension/profile-type of every tenant and service name, and the functions configured in this map. This should all be configurable.

Comment on lines +326 to +345
reader := query_backend.NewBlockReader(w.logger, w.storage)
var res, _ = reader.Invoke(ctx,
&queryv1.InvokeRequest{
Tenant: []string{c.TenantId},
StartTime: c.MinTime,
EndTime: c.MaxTime,
Query: []*queryv1.Query{{
QueryType: queryv1.QueryType_QUERY_METRICS,
Metrics: &queryv1.MetricsQuery{
FunctionsByServiceName: functions,
},
}},
QueryPlan: &queryv1.QueryPlan{
Root: &queryv1.QueryNode{
Blocks: []*metastorev1.BlockMeta{c},
},
},
LabelSelector: "{}",
},
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New query type "Metrics" help with this task

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Barely the same as Anton's exporter. #2899

Comment on lines +46 to +47
Username: "1741027",
Password: "omitted",
Copy link
Contributor Author

@alsoba13 alsoba13 Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was exporting to my grafana cloud instance. But this should come from config, either a single target datasource for OS or integrated with grafana cloud stacks

if len(by) == 0 {
fp, err = reader.Series(postings.At(), &l, &chunks)
} else {
fp, err = reader.SeriesBy(postings.At(), &l, &chunks, by...)
Copy link
Contributor Author

@alsoba13 alsoba13 Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function was deleting my labels and was keeping me from fetching them, so I used Series instead when by is empty.

Comment on lines +325 to +327
for _, c := range compacted {
reader := query_backend.NewBlockReader(w.logger, w.storage)
var res, _ = reader.Invoke(ctx,
Copy link
Collaborator

@kolesnikovae kolesnikovae Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite an original solution. However, to my mind, the only reason we're implementing this in compactor is that we don't want to query data. I might be missing something but this is what we're actually doing here: querying our compacted blocks.

I propose an alternative approach:

  1. We inject our component that extracts data from samples to the block merge process here.
  2. For every label set we encounter while handling the stream (a unique Series ID after the indexRewriter.rewriteRow call): we check if it matches any of the rules (we still want to support label filters)
  3. If series labels (ProfileEntry.Labels) match any rule, handle grouping to get the labels we want to preserve in the output time series (metrics). A conjunction of the group by label values and the rule name identifies the aggregator for the time series.
  4. We add function/call-chain filtering later. For now we focus on the core functionality.

Basically, something very similar to our downsampler, but simpler, as we don't need to handle writes to aux tables.


If we want to generate metrics with "weighted" values, where the values are fractions of the profile total (and we do, I believe), we need to do some more work. We need to get the profile total, not the row total, which is a single series sample set, while a profile may include multiple series. To get a profile total, we need to aggregate row totals by the ProfileID column: every profile has a unique ID, and as of v2, it's guaranteed, that all the profile samples will be present in the same table, regardless of their labels.

I suggest that we add it later, together with function/call-chain filtering.

@@ -83,6 +83,7 @@ message Query {
TimeSeriesQuery time_series = 5;
TreeQuery tree = 6;
PprofQuery pprof = 7;
MetricsQuery metrics = 8;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can implement the feature without extending the query API.


labelsFromFp := make(map[uint64]phlaremodel.Labels)
builders := make(map[uint64]map[string]*phlaremodel.TimeSeriesBuilder)
for rows.Next() {
Copy link
Collaborator

@kolesnikovae kolesnikovae Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this loop will suffer from very bad performance. Let's skip function filtering for now and add it later – we already have stack trace filter that can handle that, we only need to integrate it.

@kolesnikovae
Copy link
Collaborator

Regarding the compaction lag: it's totally possible that L0 compaction is delayed because we only compact data once we accumulated enough blocks. However, in the PR you mentioned, we added a parameter that controls for how long a block might be staged.

We also introduced an indicator – time to compaction. In our dev env, L0 compaction lag does not exceed 1m and p99 is around 15-20 seconds.

image

I think that relying on the "current" time might be dangerous – we could explore an option where we get timestamp of the blocks (the time they are created). Also, I think that OOO ingestion is almost inevitable: jobs might be run concurrently, and their order is not guaranteed (we don't need it for compaction); usually, this is not an issue, but if the job fails and we retry (which we do), we will likely violate the order. Fortunately, both Mimit and Prometheus handle OOO (with some caveats)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants