Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different metrics lifetime and garbage collection (forgetting outdated telemetry) #495

Open
3g0r opened this issue Jun 25, 2024 · 4 comments
Labels
C-exporter Component: exporters such as Prometheus, TCP, etc. E-intermediate Effort: intermediate. T-ergonomics Type: ergonomics.

Comments

@3g0r
Copy link

3g0r commented Jun 25, 2024

Hi, in my case I have many spawned tokio tasks that need to be measured.
Measurements for these spawned tasks unique by labels, and once solved I have to remove these measurements from the metrics registry to prevent memory leaks.
At the same time I need to keep metric COUNT_OF_ACTIVE_TASKS available while my program works.

At now I can't find any way for solving that problem using current API.

builder.idle_timeout looks good, but I have no guarantees about the interval for spawning new tasks, hence COUNT_OF_ACTIVE_TASKS could be deleted at any time and its state forgotten.

Can anyone tell me how to solve this problem without writing an absolute value to COUNT_OF_ACTIVE_TASKS on timeout in an infinite loop? 😂

@3g0r
Copy link
Author

3g0r commented Jun 27, 2024

I was thinking about collecting metrics from TCP connections - we have no guarantees about the intervals between packets in general.
For example, if we collect the number of bits sent, but there are no ping messages in the protocol between the client and server, we run the risk of forgetting the state of the metrics if the client and server are silent for a long time.

So, I think we really need to extend api. For example add ::mark_as_outdated() method to counter/histogram/gauge, or may be extend recorder api to add ::remove_<metric kind>(), or give direct access to registry.

@tobz
Copy link
Member

tobz commented Jun 28, 2024

Yeah, in general, there's no good ergonomic way to let callers (the parts of the code actually emitting the metrics) control when those metrics go away.

This will likely need to be solved through whatever we do to fix #314, since fixing that allows for a better separation between "this metric is no longer live at all" and "this metric hasn't been updated in a while and I want to stop showing it".

@tobz tobz added C-exporter Component: exporters such as Prometheus, TCP, etc. E-intermediate Effort: intermediate. T-ergonomics Type: ergonomics. labels Jun 28, 2024
@3g0r
Copy link
Author

3g0r commented Jun 28, 2024

"this metric hasn't been updated in a while and I want to stop showing it".

Do we really need this feature?

I think that if we suppress some measurements, Prometheus can't collect them, and Grafana will actually render the gaps in those time slots while the measurements are suppressed for rendering in our app.

My be we can do it more simple if just delegate the responsibility of deleting metrics to users?

At least in other programming languages I have been happy with such an api so far.

@tobz
Copy link
Member

tobz commented Jun 28, 2024

We're not going to be changing the core Recorder API to allow for arbitrarily marking a metric as done/outdated/expired.

As far as wanting to stop showing idle metrics: it's absolutely a thing people want/request. It's very useful to avoid removing a metric as soon as it's no longer used, but instead only after a long enough period of inactivity, in order to avoid sparse reporting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-exporter Component: exporters such as Prometheus, TCP, etc. E-intermediate Effort: intermediate. T-ergonomics Type: ergonomics.
Projects
None yet
Development

No branches or pull requests

2 participants