Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidance for remote write time > flush period #10

Closed
jdheyburn opened this issue Dec 21, 2021 · 8 comments
Closed

Guidance for remote write time > flush period #10

jdheyburn opened this issue Dec 21, 2021 · 8 comments

Comments

@jdheyburn
Copy link

Hey again

During our load testing we are hitting the Remote write took Ys while flush period is Xs log message and so samples are likely being dropped. In our setup we are writing directly to Prometheus with the remote-write-receiver feature.

I noticed on the README that this sentence refers to the remote_write.queue_config for tuning.

Depending on exact setup, it may be necessary to configure Prometheus and / or remote-write agent to handle the load. For example, see queue_config parameter of Prometheus.

However, this configuration can only be applied when Prometheus (or the target agent) is configured for publishing to a remote write endpoint; since queue_config is a subset of remote_write, where remote_write.url is a required field.

Is my understanding of this correct?

For our use case, we don't necessarily need the metrics in real time. Would it be possible to have the k6 metrics inserted sequentially so that if remote write receiver latency > flush period, the extension would keep hold of samples until all are published?

Thanks!

@yorugac
Copy link
Collaborator

yorugac commented Dec 21, 2021

Hi @jdheyburn

I believe you're right: queue_config is meant for another remote_write not for Prometheus endpoint itself 😞 I haven't seen separate configuration for remote-write-receiver but looking at its official design doc, it explicitly says that Prometheus receiver is not meant to solve issues like scaling so one should look to Cortex.

Other than experimenting with configuration options and flush period, another thing that should speed up the push of metrics is dropping tags completely but that does mean losing info.

For our use case, we don't necessarily need the metrics in real time. Would it be possible to have the k6 metrics inserted sequentially so that if remote write receiver latency > flush period, the extension would keep hold of samples until all are published?

At the moment, the extension starts to drop samples in this situation. Holding samples in memory is not feasible: they grow very quickly. Storing them to any kind of storage contradicts the purpose here. Still, other approaches can be tried of course, but I personally would recommend against spending too much time on that at this point: because current implementation is not at its most efficient right now. It is a known problem which depends on the ongoing work in k6 itself and is described in open issues #2 and #3. Hopefully, we'll be able to start resolving this in the next year, once k6 has some new implementations merged in.

Hope that helps 🙂

@jdheyburn
Copy link
Author

Sorry for the late reply. Thanks for the insight, I wasn't able to accurately rely on Prometheus metrics in the end. I am currently using it to load test Redis, against approx 20-30 commands - each with their own latency metrics being captured, with 1s flush period. I think this is just too many metrics to store, and defeats the original purpose of the remote_write endpoint.

A possibility could be to allow Prom to scrape k6 for the metrics itself? That way we could mitigate the ingestion overload. Not that I'm requesting it since it requires a redesign, but interested in hearing why the status quo implementation was picked :)

@yorugac
Copy link
Collaborator

yorugac commented Jan 13, 2022

Hi @jdheyburn,
I'm sorry to hear that 😞 Have you considered dropping tags for your use case? Also, I'm curious do you use xk6-redis extension?

As for the history of Prometheus vs k6, AFAIK, it is a long one and I suggest looking at the following:

Some comments there could help with an answer but in short, k6 itself cannot be viewed as a server but rather an instrumentation tool with its own complexities and limitations and scraping endpoint is one of those things that fall outside of those limitations, at least for the foreseeable future. Outputs are simply less restrictive and far easier to add in k6 than scraping endpoint.
Additionally, k6 had separate requests for remote_write specifically:

And this extension is essentially a response to the above issue: a way to have a native k6 support for Prometheus right now. Granted, it can definitely be much improved with further metrics refactoring in k6 like solving grafana/k6#1831.

Hope that helps with understanding the status quo 🙂

@jdheyburn
Copy link
Author

jdheyburn commented Jan 13, 2022

Olha I admire the detail of your replies, not just on this issue but on others I've seen as well - you're an asset to the community and I thank you! The rationale makes complete sense now. 👍🏻

I am not using the xk6-redis extension, since it provides limited commands. I am using a custom go extension, where my entire load-testing process is inspired from the same work from Gitlab. Taken from a tcpdump in production, there are several commands being captured, in fact there are 41 - so I imagine there are a lot of metrics to capture. I recently did an --out csv=results.csv of one load test of 5 minutes, and that produced a CSV file of 2.2GB. While not a 1:1 mapping, there is a lot of data to be captured.

Edit: Sorry, I lied - I am using xk6-redis 🙂

@jdheyburn
Copy link
Author

I just stumbled across https://github.com/szkiba/xk6-prometheus, which looks like it opens k6 for Prometheus scraping.

@yorugac
Copy link
Collaborator

yorugac commented Jan 14, 2022

Thank you for your kind words, Joseph ☺️

Yes, xk6-prometheus is an alternative from community's contributor; it was mentioned in the links from yesterday I think. We don't support it directly but if it can help with your use case, all for the better! Still, please watch for the updates here, for when we do get to improving efficiency in PRW extension 🙂

@jdheyburn
Copy link
Author

Just came to say I managed to get metrics out successfully with xk6-prometheus, but I'll keep an eye on this extension too so that I can make a true comparison. I have a 1s scrape set up so given that I don't run tests all that often, I would prefer the push model that xk6-output-prometheus-remote provides, so that Prom isn't having to scrape unnecessarily 😄

@codebien
Copy link
Contributor

All the major dependencies with this have been resolved. This should now only happen if the server is really experiencing heavy loads or in case of network faults. So, if it still happens on low load then a bug should be reported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants