-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many time series warning #2901
Conversation
43a0ba2
to
8509d21
Compare
Codecov Report
@@ Coverage Diff @@
## master #2901 +/- ##
==========================================
+ Coverage 76.84% 76.94% +0.09%
==========================================
Files 228 228
Lines 16988 17018 +30
==========================================
+ Hits 13055 13094 +39
+ Misses 3087 3081 -6
+ Partials 846 843 -3
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 6 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work 👍
I have some suggestions, particularly around testing.
metrics/engine/ingester_test.go
Outdated
ingester.seenTimeSeries[metrics.TimeSeries{ | ||
Metric: testMetric, | ||
Tags: piState.Registry.RootTagSet().With("a", "2"), | ||
}] = struct{}{} | ||
ingester.warnTimeSeriesLimit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test is OK as a warnTimeSeriesLimit()
unit test, but you're not testing that flushMetrics()
will actually run this check and emit the warning.
So I'd rather see a higher-level test at the output level, but feel free to keep this one as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still not covered.
Can you add another test that calls flushMetrics()
and confirms the warning is logged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
b6856fc
to
d81ad67
Compare
@imiric @na-- I pushed an HyperLogLog implementation, the last commit contains a comparison between a map and a potential HyperLogLog. The map is doing ~4x in B/op. This is good but it has a cost, the first is the obvious complexity of the implementation, but also the CPU effort is different.
|
d81ad67
to
5c7058f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done! Almost LGTM, but I think we're missing a flushMetrics()
test.
The HLL lib is a large dependency, but I agree that it's worth it in this case.
metrics/engine/ingester_test.go
Outdated
ingester.seenTimeSeries[metrics.TimeSeries{ | ||
Metric: testMetric, | ||
Tags: piState.Registry.RootTagSet().With("a", "2"), | ||
}] = struct{}{} | ||
ingester.warnTimeSeriesLimit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still not covered.
Can you add another test that calls flushMetrics()
and confirms the warning is logged?
Sorry for the late review 😞 Honestly, I think I am leaning towards the map solution, despite being the one to propose HLL... 🤔 For 100k elements, we get 4x the memory usage, but that is still only 3MB, and it's faster. For 1 million elements, we get a slightly better result, 50MB for a map vs 8MB for HLL (and HLL is a bit faster). But 50MB is still basically nothing, especially compared to the actual memory usage for the strings for these 1M time series, which will probably be several GB 😅 I don't see a situation where the memory usage of this counter would even remotely matter, compared to everything else. If we were already using a HLL library for something else in the k6 codebase, it might be worth it to use it here as well. But I don't think it's worth the huge dependency just for this. Besides, the map solution provides us with an exact count of the time series, while the HLL variant will only have an approximate count, so its UX would be a bit worse. TLDR: I vote for the map |
7328582
to
dc024b6
Compare
dc024b6
to
47e3d08
Compare
c6a1857
to
ea235c0
Compare
ea235c0
to
1902a96
Compare
// we don't care about overflow | ||
// the process should be already OOM | ||
// if the number of generated time series goes higher than N-hundred-million(s). | ||
cc.timeSeriesLimit *= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea that we want them to get the message multiple times as they continue to make more time series?
I am not really against that, just in the case that we forgo this, we can actually stop counting once the limit is hit and even throw away the seen
map.
Not really certain if users will be better off seeing this on 100k, 200k, 400k, 800k, 1.6m and so on time series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 yeah, maybe we can show that warning 5 times and then stop tracking the time series count 🤔 not that such an optimization will help very much at 3.2million time series, but still 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am mostly arguing for doing it only once.
I doubt that if this warning is ignored/not seen once it being there 3-4 more times will be that more beneficial if at all.
But thsi definitely can be done later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue that it would be useful to see it a few times, since the time between warnings would give information about how quickly the number of time series grows, which is impossible to see or guess otherwise with local k6 runs, and not always easy even with an external output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is also helpful for us, in the case of reports, to see what it is the common order of magnitude. We should act differently if they often hit around millions instead of thousands.
Closes #2765
At the moment the implementation doesn't support turning it off, we may consider using 0 for it in the future.