Guidance for remote write time > flush period #10

jdheyburn · 2021-12-21T09:12:28Z

Hey again

During our load testing we are hitting the Remote write took Ys while flush period is Xs log message and so samples are likely being dropped. In our setup we are writing directly to Prometheus with the remote-write-receiver feature.

I noticed on the README that this sentence refers to the remote_write.queue_config for tuning.

Depending on exact setup, it may be necessary to configure Prometheus and / or remote-write agent to handle the load. For example, see queue_config parameter of Prometheus.

However, this configuration can only be applied when Prometheus (or the target agent) is configured for publishing to a remote write endpoint; since queue_config is a subset of remote_write, where remote_write.url is a required field.

Is my understanding of this correct?

For our use case, we don't necessarily need the metrics in real time. Would it be possible to have the k6 metrics inserted sequentially so that if remote write receiver latency > flush period, the extension would keep hold of samples until all are published?

Thanks!

The text was updated successfully, but these errors were encountered:

yorugac · 2021-12-21T13:48:28Z

Hi @jdheyburn

I believe you're right: queue_config is meant for another remote_write not for Prometheus endpoint itself 😞 I haven't seen separate configuration for remote-write-receiver but looking at its official design doc, it explicitly says that Prometheus receiver is not meant to solve issues like scaling so one should look to Cortex.

Other than experimenting with configuration options and flush period, another thing that should speed up the push of metrics is dropping tags completely but that does mean losing info.

For our use case, we don't necessarily need the metrics in real time. Would it be possible to have the k6 metrics inserted sequentially so that if remote write receiver latency > flush period, the extension would keep hold of samples until all are published?

At the moment, the extension starts to drop samples in this situation. Holding samples in memory is not feasible: they grow very quickly. Storing them to any kind of storage contradicts the purpose here. Still, other approaches can be tried of course, but I personally would recommend against spending too much time on that at this point: because current implementation is not at its most efficient right now. It is a known problem which depends on the ongoing work in k6 itself and is described in open issues #2 and #3. Hopefully, we'll be able to start resolving this in the next year, once k6 has some new implementations merged in.

Hope that helps 🙂

jdheyburn · 2022-01-13T11:20:43Z

Sorry for the late reply. Thanks for the insight, I wasn't able to accurately rely on Prometheus metrics in the end. I am currently using it to load test Redis, against approx 20-30 commands - each with their own latency metrics being captured, with 1s flush period. I think this is just too many metrics to store, and defeats the original purpose of the remote_write endpoint.

A possibility could be to allow Prom to scrape k6 for the metrics itself? That way we could mitigate the ingestion overload. Not that I'm requesting it since it requires a redesign, but interested in hearing why the status quo implementation was picked :)

yorugac · 2022-01-13T14:53:36Z

Hi @jdheyburn,
I'm sorry to hear that 😞 Have you considered dropping tags for your use case? Also, I'm curious do you use xk6-redis extension?

As for the history of Prometheus vs k6, AFAIK, it is a long one and I suggest looking at the following:

Some comments there could help with an answer but in short, k6 itself cannot be viewed as a server but rather an instrumentation tool with its own complexities and limitations and scraping endpoint is one of those things that fall outside of those limitations, at least for the foreseeable future. Outputs are simply less restrictive and far easier to add in k6 than scraping endpoint.
Additionally, k6 had separate requests for remote_write specifically:

Support Prometheus Remote Write as an Optional Results Visualization k6#1761

And this extension is essentially a response to the above issue: a way to have a native k6 support for Prometheus right now. Granted, it can definitely be much improved with further metrics refactoring in k6 like solving grafana/k6#1831.

Hope that helps with understanding the status quo 🙂

jdheyburn · 2022-01-13T15:30:58Z

Olha I admire the detail of your replies, not just on this issue but on others I've seen as well - you're an asset to the community and I thank you! The rationale makes complete sense now. 👍🏻

I am not using the xk6-redis extension, since it provides limited commands. I am using a custom go extension, where my entire load-testing process is inspired from the same work from Gitlab. Taken from a tcpdump in production, there are several commands being captured, in fact there are 41 - so I imagine there are a lot of metrics to capture. I recently did an --out csv=results.csv of one load test of 5 minutes, and that produced a CSV file of 2.2GB. While not a 1:1 mapping, there is a lot of data to be captured.

Edit: Sorry, I lied - I am using xk6-redis 🙂

jdheyburn · 2022-01-14T09:38:52Z

I just stumbled across https://github.com/szkiba/xk6-prometheus, which looks like it opens k6 for Prometheus scraping.

yorugac · 2022-01-14T14:58:39Z

Thank you for your kind words, Joseph ☺️

Yes, xk6-prometheus is an alternative from community's contributor; it was mentioned in the links from yesterday I think. We don't support it directly but if it can help with your use case, all for the better! Still, please watch for the updates here, for when we do get to improving efficiency in PRW extension 🙂

jdheyburn · 2022-01-17T17:50:50Z

Just came to say I managed to get metrics out successfully with xk6-prometheus, but I'll keep an eye on this extension too so that I can make a true comparison. I have a 1s scrape set up so given that I don't run tests all that often, I would prefer the push model that xk6-output-prometheus-remote provides, so that Prom isn't having to scrape unnecessarily 😄

codebien · 2022-12-16T16:35:25Z

All the major dependencies with this have been resolved. This should now only happen if the server is really experiencing heavy loads or in case of network faults. So, if it still happens on low load then a bug should be reported.

codebien mentioned this issue Sep 1, 2022

K6 Remote write took 5.1603032s while flush period is 1s. Some samples m ay be dropped. #36

Closed

codebien closed this as completed Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guidance for remote write time > flush period #10

Guidance for remote write time > flush period #10

jdheyburn commented Dec 21, 2021

yorugac commented Dec 21, 2021 •

edited

Loading

jdheyburn commented Jan 13, 2022

yorugac commented Jan 13, 2022

jdheyburn commented Jan 13, 2022 •

edited

Loading

jdheyburn commented Jan 14, 2022

yorugac commented Jan 14, 2022

jdheyburn commented Jan 17, 2022

codebien commented Dec 16, 2022

Guidance for remote write time > flush period #10

Guidance for remote write time > flush period #10

Comments

jdheyburn commented Dec 21, 2021

yorugac commented Dec 21, 2021 • edited Loading

jdheyburn commented Jan 13, 2022

yorugac commented Jan 13, 2022

jdheyburn commented Jan 13, 2022 • edited Loading

jdheyburn commented Jan 14, 2022

yorugac commented Jan 14, 2022

jdheyburn commented Jan 17, 2022

codebien commented Dec 16, 2022

yorugac commented Dec 21, 2021 •

edited

Loading

jdheyburn commented Jan 13, 2022 •

edited

Loading