Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many time series warning #2901

Merged
merged 2 commits into from
Mar 29, 2023
Merged

Too many time series warning #2901

merged 2 commits into from
Mar 29, 2023

Conversation

codebien
Copy link
Contributor

@codebien codebien commented Feb 6, 2023

Closes #2765

At the moment the implementation doesn't support turning it off, we may consider using 0 for it in the future.

@codebien codebien added this to the v0.44.0 milestone Feb 6, 2023
@codebien codebien self-assigned this Feb 6, 2023
@github-actions github-actions bot requested review from imiric and na-- February 6, 2023 12:09
@codecov-commenter
Copy link

codecov-commenter commented Feb 6, 2023

Codecov Report

Merging #2901 (21befe9) into master (4f5c142) will increase coverage by 0.09%.
The diff coverage is 100.00%.

❗ Current head 21befe9 differs from pull request most recent head 1902a96. Consider uploading reports for the commit 1902a96 to get more accurate results

@@            Coverage Diff             @@
##           master    #2901      +/-   ##
==========================================
+ Coverage   76.84%   76.94%   +0.09%     
==========================================
  Files         228      228              
  Lines       16988    17018      +30     
==========================================
+ Hits        13055    13094      +39     
+ Misses       3087     3081       -6     
+ Partials      846      843       -3     
Flag Coverage Δ
ubuntu 76.94% <100.00%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
lib/testutils/logrus_hook.go 100.00% <100.00%> (ø)
metrics/engine/engine.go 88.98% <100.00%> (+0.09%) ⬆️
metrics/engine/ingester.go 91.48% <100.00%> (+6.30%) ⬆️

... and 6 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Contributor

@imiric imiric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 👍

I have some suggestions, particularly around testing.

metrics/engine/ingester.go Outdated Show resolved Hide resolved
metrics/engine/ingester.go Outdated Show resolved Hide resolved
metrics/engine/ingester.go Outdated Show resolved Hide resolved
metrics/engine/ingester.go Outdated Show resolved Hide resolved
Comment on lines 115 to 119
ingester.seenTimeSeries[metrics.TimeSeries{
Metric: testMetric,
Tags: piState.Registry.RootTagSet().With("a", "2"),
}] = struct{}{}
ingester.warnTimeSeriesLimit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test is OK as a warnTimeSeriesLimit() unit test, but you're not testing that flushMetrics() will actually run this check and emit the warning.

So I'd rather see a higher-level test at the output level, but feel free to keep this one as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not covered.

Can you add another test that calls flushMetrics() and confirms the warning is logged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@codebien codebien force-pushed the time-series-limit branch 2 times, most recently from b6856fc to d81ad67 Compare March 14, 2023 17:52
@codebien
Copy link
Contributor Author

@imiric @na-- I pushed an HyperLogLog implementation, the last commit contains a comparison between a map and a potential HyperLogLog. The map is doing ~4x in B/op.

This is good but it has a cost, the first is the obvious complexity of the implementation, but also the CPU effort is different.

  1. Complexity: The current hashing mechanism for the TimeSeries struct uses a hacky way to reuse the Go's hash function used by the map type which is very efficient. On the other side, this is fragile because it depends on Go's internals. In the case, we would adopt this solution we should probably reproduce directly in k6 the current hack that the external library is doing, so we can drop the new dependency. I haven't found another way to do this without relying on slow things.
    In terms of HLL implementation I think we can have our internal implementation but for a first release should be good to rely on the external library.

  2. CPUtime: The map solution is mostly invisible, instead HLL.Add makes an addition of ~150ns/op.

Copy link
Contributor

@imiric imiric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done! Almost LGTM, but I think we're missing a flushMetrics() test.

The HLL lib is a large dependency, but I agree that it's worth it in this case.

metrics/engine/ingester.go Outdated Show resolved Hide resolved
Comment on lines 115 to 119
ingester.seenTimeSeries[metrics.TimeSeries{
Metric: testMetric,
Tags: piState.Registry.RootTagSet().With("a", "2"),
}] = struct{}{}
ingester.warnTimeSeriesLimit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not covered.

Can you add another test that calls flushMetrics() and confirms the warning is logged?

@na--
Copy link
Member

na-- commented Mar 27, 2023

Sorry for the late review 😞

Honestly, I think I am leaning towards the map solution, despite being the one to propose HLL... 🤔 For 100k elements, we get 4x the memory usage, but that is still only 3MB, and it's faster. For 1 million elements, we get a slightly better result, 50MB for a map vs 8MB for HLL (and HLL is a bit faster). But 50MB is still basically nothing, especially compared to the actual memory usage for the strings for these 1M time series, which will probably be several GB 😅

I don't see a situation where the memory usage of this counter would even remotely matter, compared to everything else. If we were already using a HLL library for something else in the k6 codebase, it might be worth it to use it here as well. But I don't think it's worth the huge dependency just for this. Besides, the map solution provides us with an exact count of the time series, while the HLL variant will only have an approximate count, so its UX would be a bit worse.

TLDR: I vote for the map

@codebien codebien force-pushed the time-series-limit branch 2 times, most recently from 7328582 to dc024b6 Compare March 28, 2023 13:39
metrics/engine/ingester.go Outdated Show resolved Hide resolved
metrics/engine/ingester.go Outdated Show resolved Hide resolved
metrics/engine/ingester.go Outdated Show resolved Hide resolved
metrics/engine/ingester.go Show resolved Hide resolved
Comment on lines +146 to +149
// we don't care about overflow
// the process should be already OOM
// if the number of generated time series goes higher than N-hundred-million(s).
cc.timeSeriesLimit *= 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that we want them to get the message multiple times as they continue to make more time series?

I am not really against that, just in the case that we forgo this, we can actually stop counting once the limit is hit and even throw away the seen map.

Not really certain if users will be better off seeing this on 100k, 200k, 400k, 800k, 1.6m and so on time series.

Copy link
Member

@na-- na-- Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 yeah, maybe we can show that warning 5 times and then stop tracking the time series count 🤔 not that such an optimization will help very much at 3.2million time series, but still 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am mostly arguing for doing it only once.

I doubt that if this warning is ignored/not seen once it being there 3-4 more times will be that more beneficial if at all.

But thsi definitely can be done later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that it would be useful to see it a few times, since the time between warnings would give information about how quickly the number of time series grows, which is impossible to see or guess otherwise with local k6 runs, and not always easy even with an external output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is also helpful for us, in the case of reports, to see what it is the common order of magnitude. We should act differently if they often hit around millions instead of thousands.

@codebien codebien merged commit dccdf1b into master Mar 29, 2023
@codebien codebien deleted the time-series-limit branch March 29, 2023 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add warning if there are too many time series
5 participants