-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP Add batching for gauge metrics #57
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Matt Plachter <[email protected]>
bbe7325
to
bdbab36
Compare
@@ -24,6 +24,7 @@ func Execute() { | |||
|
|||
rootCmd.PersistentFlags().StringSliceVar(&cfg.AuthUsers, "AuthUsers", []string{}, "List of allowed auth users and their passwords comma separated\n Example: \"user1=pass1,user2=pass2\"") | |||
rootCmd.PersistentFlags().StringVar(&cfg.ApiListen, "apiListen", ":80", "Listen for API requests on this host/port.") | |||
rootCmd.PersistentFlags().IntVar(&cfg.MetricBatchInterval, "metricBatchInterval", 5, "The amount of seconds to batch the metrics collected") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resetting on a timer requires us to sync this up w/ prometheus' scraping schedule, right? Default seems to be 60 seconds, which means we would by (by default) averaging the last 5 seconds worth of gauge stats, and ignoring the 55 seconds before that.
wouldn't resetting the stats whenever the metrics got scraped avoid this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically but than if someone/something just hits
the endpoint for testing purposes all the metrics get jacked up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we could do is check if it's been scraped during the interval and if so then we can reset the gauges if the endpoint hasn't been scraped yet we can merge in gauges and if it has we can reset gauges....
Signed-off-by: Matt Plachter <[email protected]>
1c4f10a
to
2be4980
Compare
Signed-off-by: Matt Plachter <[email protected]>
Hey, I see people are trying to solve the same problem I wandered over here from. The solution I had was also "callback on GET /metrics that wipes after a scrape." If you're aggregating and squashing labels, that's the only way to prevent double counting. As far as "what if a user touches the endpoint," that's easily solved with HTTP Basic Auth. I'm sure everybody's in the same boat with "they don't build computers with the amount of memory we need, so let's aggregate some stuff." Wipe-on-scrape has the added benefit of being homeostatic. Specifically, as the fleet that's being aggregated horizontally scales, memory usage increases at the aggregator level. You could tune memory usage by scraping more often, causing the aggregators to wipe more often. But this would increase memory requirements at the central server. But then you just aggregate again through another layer of aggregators. From there, it's aggregators all the way down and the SREs are happy! |
@faangbait, thanks for your insight here. I agree, I don't think duplicating all the metrics is valuable. We do have a few issues with gauges tho, as we don't want to add them up we need to calculate a floating average for a given probed period. We also can't just keep averaging them out between probe intervals as thats dependent on the endpoint being scrapped, which could lead to an average over different time durations. |
TODO: