Address high IOPs usage of the Gnocchi Ceph pool #1381

rafaelweingartner · 2024-03-27T17:25:58Z

Before I describe the situation, let me put us all in the same page regarding the concepts we are dealing with. This patch is about the backend Ceph that can be used to store processed and raw measurements for Gnocchi. Ceph is a software defined storage, which when deployed implements a concept called Reliable Autonomous Distributed Object Store (RADOS). Do not confuse this RADOS with RadosGW, which is the sub-system that implements S3 API to consume a Ceph backend. In Ceph we have the RADOS objects, which are different from RadosGW objects (S3 objects).The RADOS objects depend on the underlying system that is consuming them; and, they (the RADOS objects) are the building blocks of any Ceph cluster. For instance, when using Rados Block Device (RBD), the libRBD or the KRBD use by default 4MiB Rados object size. Each IOP shown by Ceph is an operation either read or write of a RADOS object. The RADOS object can be customized/used differently depending on the system that consumes Ceph.

Different from systems that consume Ceph via a standard protocol such as RBD or CephFS (which mounts a Ceph pool as a POSIX file system), Gnocchi consumes Ceph natively; I mean, Gnocchi interacts directly with the low level RADOS objects. Every metric (either processed or raw) are stored in a single RADOS object; processed metrics are stored in different files according to their time frames (time splits). Differently from other systems where there is a standard size for the RADOS objects, Gnocchi handles the files in an isolated fashion. Therefore, for some metrics there are RADOS object bigger or smaller depending on the volume of data we have for the given metric and time-frame.

Gnocchi uses LIBRADOS [1] to interact with a Ceph backend. When writing a raw metric, Gnocchi uses the method [2], which writes all dataset in a RADOS object. That write represents (is counted by Ceph) one (1) IOP operation; it does not matter if it is a dataset of 1k, 1M, or 10MB, it will be a single write operation. On the other hand, when reading, Gnocchi uses the method [3]; as one can see, the read operation does not read the complete file in a single operation. It will read the data in pieces, and the default chunk size is 8k. This can cause high READ IOPs in certain cases, such as when we have raw metrics for a one year backwindow.

The proposal to address this situation is to add an adaptative read process for Gnocchi when it uses Ceph as a backend. I mean, we store the size of the RADOS file for each metric, and then we use the size of the file to configure the read buffer. This will make Gnocchi to reduce the number of read operations in the Ceph cluster.

The following picture demonstrates the difference between the standard Gnocchi Ceph code, and the proposed solution. Furthermore, in beige color, there is an example of a further improvement, which is achieved together with this code and some tuning such as disabling the "greedy" option in Gnocchi and increasing the interval between MetricD processing from 60s to 300s.

The spikes shown in the picture, which are highlighted with a star are a consequence of the code. I mean, in the worst case scenario, in the first run, the system will not have "learned" the RADOS object size. Therefore, the read is not optimal. After the first round of processing, the system will learn the pattern, and then the reads are improved.

[1] https://docs.ceph.com/en/latest/rados/api/python/
[2] https://docs.ceph.com/en/latest/rados/api/python#rados.Ioctx.write_full
[3] https://docs.ceph.com/en/latest/rados/api/python/#rados.Ioctx.read

Before I describe the situation, let me put us all in the same page regarding the concepts we are dealing with. This patch is about the backend Ceph that can be used to store processed and raw measurements for Gnocchi. Ceph is a software defined storage, which when deployed implements a concept called Reliable Autonomous Distributed Object Store (RADOS). Do not confuse this RADOS with RadosGW, which is the sub-system that implements S3 API to consume a Ceph backend. In Ceph we have the RADOS objects, which are different from RadosGW objects (S3 objects).The RADOS objects depend on the underlying system that is consuming them; and, they (the RADOS objects) are the building blocks of any Ceph cluster. For instance, when using Rados Block Device (RBD), the libRBD or the KRBD use by default 4MiB Rados object size. Each IOP shown by Ceph is an operation either read or write of a RADOS object. The RADOS object can be customized/used differently depending on the system that consumes Ceph. Different from systems that consume Ceph via a standard protocol such as RBD or CephFS (which mounts a Ceph pool as a POSIX file system), Gnocchi consumes Ceph natively; I mean, Gnocchi interacts directly with the low level RADOS objects. Every metric (either processed or raw) are stored in a single RADOS object; processed metrics are stored in different files according to their time frames (time splits). Differently from other systems where there is a standard size for the RADOS objects, Gnocchi handles the files in an isolated fashion. Therefore, for some metrics there are RADOS object bigger or smaller depending on the volume of data we have for the given metric and time-frame. Gnocchi uses LIBRADOS [1] to interact with a Ceph backend. When writing a raw metric, Gnocchi uses the method [2], which writes all dataset in a RADOS object. That write represents (is counted by Ceph) one (1) IOP operation; it does not matter if it is a dataset of 1k, 1M, or 10MB, it will be a single write operation. On the other hand, when reading, Gnocchi uses the method [3]; as one can see, the read operation does not read the complete file in a single operation. It will read the data in pieces, and the default chunk size is 8k. This can cause high READ IOPs in certain cases, such as when we have raw metrics for a one year backwindow. The proposal to address this situation is to add an adaptative read process for Gnocchi when it uses Ceph as a backend. I mean, we store the size of the RADOS file for each metric, and then we use the size of the file to configure the read buffer. This will make Gnocchi to reduce the number of read operations in the Ceph cluster. [1] https://docs.ceph.com/en/latest/rados/api/python/ [2] https://docs.ceph.com/en/latest/rados/api/python#rados.Ioctx.write_full [3] https://docs.ceph.com/en/latest/rados/api/python/#rados.Ioctx.read

gnocchi/storage/ceph.py

Add Guto's suggestion. Co-authored-by: Daniel Augusto Veronezi Salvador <[email protected]>

rafaelweingartner · 2024-04-30T19:21:29Z

Hello @jd and @chungg we have interesting new patches that might be worth for you guys to take a look at. This one, for instance, provides a great benefits for people using gnocchi with a Ceph backend.

chungg

thanks for the detailed context! i haven't touched ceph in years so i won't comment on whether this makes sense.

please correct me if i'm wrong but can you confirm that no matter what value we set read buffer size, it will read everything it needs? asking because i can see MAP_UNAGGREGATED_METRIC_NAME_BY_SIZE getting out of sync across workers.

a potential concern may be that depending on the number of metrics assigned to a worker, the MAP_UNAGGREGATED_METRIC_NAME_BY_SIZE lookup may get large and consume (too much) memory? maybe it makes sense to only store metrics we know have objects larger than 8192?

one more thought, what happens if we increase read buffer globally? does hardcoding 16384+ buffer size make performance worse if the object is smaller?

chungg · 2024-05-10T16:18:38Z

gnocchi/storage/ceph.py

@@ -88,6 +94,11 @@ def _store_metric_splits(self, metrics_keys_aggregations_data_offset,
                for key, agg, data, offset in keys_aggregations_data_offset:
                    name = self._get_object_name(
                        metric, key, agg.method, version)
+                    metric_size = len(data)
+
+                    MAP_UNAGGREGATED_METRIC_NAME_BY_SIZE[name] = metric_size


probably not worth adding a check here but i think this is storing more than just unaggregated/raw measures object size.

Yes, @chungg the mapping will not be synced across all the MetricD agents. However, that is not an issue. In the worst case scenario we would just execute one extra read, as they (the mappings) in the agents will not get out of sync by a huge factor.

The following picture is the usage of IOPs in a Gnocchi setup that has been running for months now with this patch applied. As you can see, there are more writes than reads, which is what happens before this patch.

Before this patch, this is the behavior we had:

a potential concern may be that depending on the number of metrics assigned to a worker, the MAP_UNAGGREGATED_METRIC_NAME_BY_SIZE lookup may get large and consume (too much) memory? maybe it makes sense to only store metrics we know have objects larger than 8192?

I agree, I will do so this change.

one more thought, what happens if we increase read buffer globally? does hardcoding 16384+ buffer size make performance worse if the object is smaller?

That is a good question. It seems that on Ceph side, we could not reach a conclusion. Moreover, by putting this burden on the operator, he would just bump numbers, without understanding what is going on in the system. That is why we decided to use a smarter approach to record the latest size that was written in an given RADOS object.

BTW, we did some analysis, and we found RADOS objects of size equals to 10M, 20M, 40M. That is why a single global configuration would probably not help much operators.

probably not worth adding a check here but i think this is storing more than just unaggregated/raw measures object size.

Yes, it is. This is on purpose. Depending on how you use Gnocchi, for instance, CloudKitty, and so on., You are constantly affecting the same split. That is why we also added this here.

chungg

thanks! this makes sense to me. will let more active members merge (or will merge if no one else does).

rafaelweingartner · 2024-05-13T15:52:16Z

thanks! this makes sense to me. will let more active members merge (or will merge if no one else does).

Awesome! Thanks for your review!

pedro-martins · 2024-05-29T18:33:31Z

gnocchi/storage/ceph.py

@@ -88,6 +94,13 @@ def _store_metric_splits(self, metrics_keys_aggregations_data_offset,
                for key, agg, data, offset in keys_aggregations_data_offset:
                    name = self._get_object_name(
                        metric, key, agg.method, version)
+                    metric_size = len(data)
+
+                    if metric_size > DEFAULT_RADOS_BUFFER_SIZE:


shouldnt we keep the old metric_size if it is greater than the new one? it could reduce some problems related to volatile object sizes (which increase and decrease constantly).

If the objects are constantly growing and never gets smaller, maybe using some approach like "if the new size is greater than the current buffer size, I set the new buffer size as two times the new size", it should reduce some unnecessary reads if the rados object gets always bigger.

It is just a suggestion, the overall code seems pretty good to me, good work.

That is a good point. We have not seen this case of objects reducing in size and never growing back. Normally, they will grow up to a certain size, when the back-window is saturated, and they will never go beyond that. That is why we are using the exact value of the length and not any other technique such as using bigger numbers and son on.

I mean, once we reach the maximum RADOS object size according to the limit of the backwindow, the object will maintain that size as the truncate is only executed when the new datapoints are received. Therefore, one new comes in, and one old is deleted.

tobias-urdin

LGTM

rafaelweingartner · 2024-05-31T12:07:36Z

@tobias-urdin, thanks for the support here!

GutoVeronezi approved these changes Apr 9, 2024

View reviewed changes

gnocchi/storage/ceph.py Outdated Show resolved Hide resolved

gnocchi/storage/ceph.py Outdated Show resolved Hide resolved

Update gnocchi/storage/ceph.py

f8fed0d

Add Guto's suggestion. Co-authored-by: Daniel Augusto Veronezi Salvador <[email protected]>

chungg reviewed May 10, 2024

View reviewed changes

Address Chungg reviews

8a0dcec

chungg approved these changes May 13, 2024

View reviewed changes

rafaelweingartner mentioned this pull request May 28, 2024

Automatically detect deleted resources #1386

Open

pedro-martins approved these changes May 29, 2024

View reviewed changes

tobias-urdin approved these changes May 31, 2024

View reviewed changes

tobias-urdin merged commit 88ee87d into gnocchixyz:master May 31, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address high IOPs usage of the Gnocchi Ceph pool #1381

Address high IOPs usage of the Gnocchi Ceph pool #1381

rafaelweingartner commented Mar 27, 2024 •

edited

Loading

rafaelweingartner commented Apr 30, 2024

chungg left a comment •

edited

Loading

chungg May 10, 2024

rafaelweingartner May 13, 2024

chungg left a comment

rafaelweingartner commented May 13, 2024

pedro-martins May 29, 2024

rafaelweingartner May 29, 2024

tobias-urdin left a comment

rafaelweingartner commented May 31, 2024

Address high IOPs usage of the Gnocchi Ceph pool #1381

Address high IOPs usage of the Gnocchi Ceph pool #1381

Conversation

rafaelweingartner commented Mar 27, 2024 • edited Loading

rafaelweingartner commented Apr 30, 2024

chungg left a comment • edited Loading

Choose a reason for hiding this comment

chungg May 10, 2024

Choose a reason for hiding this comment

rafaelweingartner May 13, 2024

Choose a reason for hiding this comment

chungg left a comment

Choose a reason for hiding this comment

rafaelweingartner commented May 13, 2024

pedro-martins May 29, 2024

Choose a reason for hiding this comment

rafaelweingartner May 29, 2024

Choose a reason for hiding this comment

tobias-urdin left a comment

Choose a reason for hiding this comment

rafaelweingartner commented May 31, 2024

rafaelweingartner commented Mar 27, 2024 •

edited

Loading

chungg left a comment •

edited

Loading