Gaps in reports #1100

tkdeshpande · 2019-08-03T04:42:54Z

Running test with InfluxDb and Grafana shows gaps in reports.

This happens for any test that lasts an hour. I've tried atleast 5 times and have gaps in each one.
The reports become un-usable because of the gaps and my organisation is seriously looking for other options if this doesn't work out.
Please help!

mstoykov · 2019-08-05T05:23:39Z

Can you provide us with some logs from the execution of k6? If something went wrong k6 would've logged it ;)

Maybe influxdb is being overload as in #1060 ? Maybe you are hitting the body size limit of influxdb? But in all cases more info would be needed to debug your problem.

mstoykov · 2019-08-09T12:06:35Z

After some more investigating I am pretty sure this is because of k6 fails to send the metrics to influxdb and it doesn't retry.
So if you are hitting the body size limit or any other kind of limit, while you will get an warning in the k6 terminal you won't get your data. Can you check the k6 execution logs and see if any error/warning messages are emitted ?

Looking at the code I am going to do a lot of changes so maybe I will try to add some retry logic akin to the cloud output .

Previously to this k6 will write to influxdb every second, but if that write took more than 1 second it won't start a second write but instead the wait for it. This will generally lead to the write times getting bigger and bigger as more and more data is being written until the max body that influxdb will take is reached when it will return an error and k6 will drop that data. With this commit there will be a configurable number of parallel writes (10 by default) that will trigger again every 1 second (also now configurable), but if those get exhausted it will start queueing the samples each second instead of combining them and than writing a big chunk which has a chance of hitting the max body. I tested with a simple script doing batch request for an empty local file with 40VUs. Without an output it was getting 8.1K RPS with 650MB of memory usage. previous to this commit the usage of ram was ~5.7GB for 5736 rps and practically all the data gets lost if you don't up the max body and even than a lot of the data is lost while the memory usage goes up. After this commit the usage of ram was ~2.4GB(or less in some of the tests) with 6273 RPS and there was no lost of data. Even with this commit doing 2 hour of that simple script dies after 1hour and 35 minutes using around 15GB (the test system has 16). Can't be sure of lost of data, as influxdb eat 32GB of memory trying to visualize it. Some minor problems with this solution is that: 1. We use a lot of goroutines if things start slowing down - probably not a big problem 2. We can probably better batch stuff if we add/keep all the unsend samples together 3. By far the biggest: because the writes are slow if the test is stopped (with Ctrl+C) or it finishes naturally, waiting for those writes can take considerable amount of time - in the above example the 4 minutes tests generally took around 5 minutes :( All of those can be better handled with some more sophisticated queueing code at later time. closes #1081, fixes #1100, fixes #182

Previously to this k6 will write to influxdb every second, but if that write took more than 1 second it won't start a second write but instead wait for it. This will generally lead to the write times getting bigger and bigger as more and more data is being written until the max body that influxdb will take is reached when it will return an error and k6 will drop that data. With this commit a configurable number of parallel writes (10 by default), starting again every 1 second (also now configurable). Additionally if we reach the 10 concurrent writes instead of sending all the data that accumulates we will just queue the samples that were generated. This should considerably help with no hitting the max body size of influxdb. I tested with a simple script, doing batch request for an empty local file with 40VUs. Without an output it was getting 8.1K RPS with 650MB of memory usage. Previous to this commit the usage of ram was ~5.7GB for 5736 rps and practically all the data gets lost if you don't up the max body and even than a lot of the data is lost while the memory usage goes up. After this commit the usage of ram was ~2.4GB(or less in some of the tests) with 6273 RPS and there was no lost of data. Even with this commit doing 2 hour of that simple script dies after 1hour and 35 minutes using around 15GB (the test system has 16). Can't be sure of lost of data, as influxdb eat 32GB of memory trying to visualize it and I had to kill it ;(. Some problems with this solution are that: 1. We use a lot of goroutines if things start slowing down - probably not a big problem, but still good idea to fix. 2. We can probably better batch stuff if we add/keep all the unsend samples together and cut them in let say 50k samples. 3. By far the biggest: because the writes are slow if the test is stopped (with Ctrl+C) or it finishes naturally, waiting for those writes can take considerable amount of time - in the above example the 4 minutes tests generally took around 5 minutes :( All of those can be better handled with some more sophisticated queueing code at later time. closes #1081, fixes #1100, fixes #182

tkdeshpande · 2019-08-28T13:27:05Z

@mstoykov : Thanks a lot for looking into it. I have tried to find logs related to this for a while but couldn't find anything.
It very well looks like a data loss issue for now which makes the InfluxDB and Grafana integration unusable.

Thanks again for the support!

mstoykov · 2019-08-28T13:34:24Z

Can you try to build from #1113 and test with it ?

Previously to this k6 will write to influxdb every second, but if that write took more than 1 second it won't start a second write but instead wait for it. This will generally lead to the write times getting bigger and bigger as more and more data is being written until the max body that influxdb will take is reached when it will return an error and k6 will drop that data. With this commit a configurable number of parallel writes (10 by default), starting again every 1 second (also now configurable). Additionally if we reach the 10 concurrent writes instead of sending all the data that accumulates we will just queue the samples that were generated. This should considerably help with no hitting the max body size of influxdb. I tested with a simple script, doing batch request for an empty local file with 40VUs. Without an output it was getting 8.1K RPS with 650MB of memory usage. Previous to this commit the usage of ram was ~5.7GB for 5736 rps and practically all the data gets lost if you don't up the max body and even than a lot of the data is lost while the memory usage goes up. After this commit the usage of ram was ~2.4GB(or less in some of the tests) with 6273 RPS and there was no lost of data. Even with this commit doing 2 hour of that simple script dies after 1hour and 35 minutes using around 15GB (the test system has 16). Can't be sure of lost of data, as influxdb eat 32GB of memory trying to visualize it and I had to kill it ;(. Some problems with this solution are that: 1. We use a lot of goroutines if things start slowing down - probably not a big problem, but still good idea to fix. 2. We can probably better batch stuff if we add/keep all the unsend samples together and cut them in let say 50k samples. 3. By far the biggest: because the writes are slow if the test is stopped (with Ctrl+C) or it finishes naturally, waiting for those writes can take considerable amount of time - in the above example the 4 minutes tests generally took around 5 minutes :( All of those can be better handled with some more sophisticated queueing code at later time. closes #1081, fixes #1100, fixes #182

tkdeshpande · 2019-08-31T03:47:21Z

Can you try to build from #1113 and test with it?

Sure @mstoykov ! Thanks a lot for the help! I will update you soon.

mstoykov added the question label Aug 5, 2019

mstoykov added the enhancement label Aug 9, 2019

na-- assigned mstoykov Aug 27, 2019

na-- added this to the v0.26.0 milestone Aug 27, 2019

mstoykov closed this as completed in 7517d22 Sep 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gaps in reports #1100

Gaps in reports #1100

tkdeshpande commented Aug 3, 2019 •

edited

Loading

mstoykov commented Aug 5, 2019

mstoykov commented Aug 9, 2019

tkdeshpande commented Aug 28, 2019

mstoykov commented Aug 28, 2019

tkdeshpande commented Aug 31, 2019

Gaps in reports #1100

Gaps in reports #1100

Comments

tkdeshpande commented Aug 3, 2019 • edited Loading

mstoykov commented Aug 5, 2019

mstoykov commented Aug 9, 2019

tkdeshpande commented Aug 28, 2019

mstoykov commented Aug 28, 2019

tkdeshpande commented Aug 31, 2019

tkdeshpande commented Aug 3, 2019 •

edited

Loading