Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address high IOPs usage of the Gnocchi Ceph pool #1381

Conversation

rafaelweingartner
Copy link
Contributor

@rafaelweingartner rafaelweingartner commented Mar 27, 2024

Before I describe the situation, let me put us all in the same page regarding the concepts we are dealing with. This patch is about the backend Ceph that can be used to store processed and raw measurements for Gnocchi. Ceph is a software defined storage, which when deployed implements a concept called Reliable Autonomous Distributed Object Store (RADOS). Do not confuse this RADOS with RadosGW, which is the sub-system that implements S3 API to consume a Ceph backend. In Ceph we have the RADOS objects, which are different from RadosGW objects (S3 objects).The RADOS objects depend on the underlying system that is consuming them; and, they (the RADOS objects) are the building blocks of any Ceph cluster. For instance, when using Rados Block Device (RBD), the libRBD or the KRBD use by default 4MiB Rados object size. Each IOP shown by Ceph is an operation either read or write of a RADOS object. The RADOS object can be customized/used differently depending on the system that consumes Ceph.

Different from systems that consume Ceph via a standard protocol such as RBD or CephFS (which mounts a Ceph pool as a POSIX file system), Gnocchi consumes Ceph natively; I mean, Gnocchi interacts directly with the low level RADOS objects. Every metric (either processed or raw) are stored in a single RADOS object; processed metrics are stored in different files according to their time frames (time splits). Differently from other systems where there is a standard size for the RADOS objects, Gnocchi handles the files in an isolated fashion. Therefore, for some metrics there are RADOS object bigger or smaller depending on the volume of data we have for the given metric and time-frame.

Gnocchi uses LIBRADOS [1] to interact with a Ceph backend. When writing a raw metric, Gnocchi uses the method [2], which writes all dataset in a RADOS object. That write represents (is counted by Ceph) one (1) IOP operation; it does not matter if it is a dataset of 1k, 1M, or 10MB, it will be a single write operation. On the other hand, when reading, Gnocchi uses the method [3]; as one can see, the read operation does not read the complete file in a single operation. It will read the data in pieces, and the default chunk size is 8k. This can cause high READ IOPs in certain cases, such as when we have raw metrics for a one year backwindow.

The proposal to address this situation is to add an adaptative read process for Gnocchi when it uses Ceph as a backend. I mean, we store the size of the RADOS file for each metric, and then we use the size of the file to configure the read buffer. This will make Gnocchi to reduce the number of read operations in the Ceph cluster.

The following picture demonstrates the difference between the standard Gnocchi Ceph code, and the proposed solution. Furthermore, in beige color, there is an example of a further improvement, which is achieved together with this code and some tuning such as disabling the "greedy" option in Gnocchi and increasing the interval between MetricD processing from 60s to 300s.

Screenshot from 2024-03-27 14-13-41

The spikes shown in the picture, which are highlighted with a star are a consequence of the code. I mean, in the worst case scenario, in the first run, the system will not have "learned" the RADOS object size. Therefore, the read is not optimal. After the first round of processing, the system will learn the pattern, and then the reads are improved.

[1] https://docs.ceph.com/en/latest/rados/api/python/
[2] https://docs.ceph.com/en/latest/rados/api/python#rados.Ioctx.write_full
[3] https://docs.ceph.com/en/latest/rados/api/python/#rados.Ioctx.read

Before I describe the situation, let me put us all in the same page
regarding the concepts we are dealing with. This patch is about the
backend Ceph that can be used to store processed and raw measurements
for Gnocchi. Ceph is a software defined storage, which when deployed
implements a concept called Reliable Autonomous Distributed Object Store
(RADOS). Do not confuse this RADOS with RadosGW, which is the sub-system
that implements S3 API to consume a Ceph backend. In Ceph we have the
RADOS objects, which are different from RadosGW objects (S3 objects).The
RADOS objects depend on the underlying system that is consuming them;
and, they (the RADOS objects) are the building blocks of any Ceph
cluster. For instance, when using Rados Block Device (RBD), the libRBD
or the KRBD use by default 4MiB Rados object size. Each IOP shown by
Ceph is an operation either read or write of a RADOS object. The RADOS
object can be customized/used differently depending on the system that
consumes Ceph.

Different from systems that consume Ceph via a standard protocol such as
RBD or CephFS (which mounts a Ceph pool as a POSIX file system), Gnocchi
consumes Ceph natively; I mean, Gnocchi interacts directly with the low
level RADOS objects. Every metric (either processed or raw) are stored
in a single RADOS object; processed metrics are stored in different
files according to their time frames (time splits). Differently from
other systems where there is a standard size for the RADOS objects,
Gnocchi handles the files in an isolated fashion. Therefore, for some
metrics there are RADOS object bigger or smaller depending on the volume
of data we have for the given metric and time-frame.

Gnocchi uses LIBRADOS [1] to interact with a Ceph backend. When writing
a raw metric, Gnocchi uses the method [2], which writes all dataset in a
RADOS object. That write represents (is counted by Ceph) one (1) IOP
operation; it does not matter if it is a dataset of 1k, 1M, or 10MB, it
will be a single write operation. On the other hand, when reading,
Gnocchi uses the method [3]; as one can see, the read operation does not
read the complete file in a single operation. It will read the data in
pieces, and the default chunk size is 8k. This can cause high READ IOPs
in certain cases, such as when we have raw metrics for a one year
backwindow.

The proposal to address this situation is to add an adaptative read
process for Gnocchi when it uses Ceph as a backend. I mean, we store the
size of the RADOS file for each metric, and then we use the size of the
file to configure the read buffer. This will make Gnocchi to reduce the
number of read operations in the Ceph cluster.

[1] https://docs.ceph.com/en/latest/rados/api/python/
[2] https://docs.ceph.com/en/latest/rados/api/python#rados.Ioctx.write_full
[3] https://docs.ceph.com/en/latest/rados/api/python/#rados.Ioctx.read
gnocchi/storage/ceph.py Outdated Show resolved Hide resolved
gnocchi/storage/ceph.py Outdated Show resolved Hide resolved
Add Guto's suggestion.

Co-authored-by: Daniel Augusto Veronezi Salvador <[email protected]>
@rafaelweingartner
Copy link
Contributor Author

Hello @jd and @chungg we have interesting new patches that might be worth for you guys to take a look at. This one, for instance, provides a great benefits for people using gnocchi with a Ceph backend.

Copy link
Member

@chungg chungg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the detailed context! i haven't touched ceph in years so i won't comment on whether this makes sense.

please correct me if i'm wrong but can you confirm that no matter what value we set read buffer size, it will read everything it needs? asking because i can see MAP_UNAGGREGATED_METRIC_NAME_BY_SIZE getting out of sync across workers.

a potential concern may be that depending on the number of metrics assigned to a worker, the MAP_UNAGGREGATED_METRIC_NAME_BY_SIZE lookup may get large and consume (too much) memory? maybe it makes sense to only store metrics we know have objects larger than 8192?

one more thought, what happens if we increase read buffer globally? does hardcoding 16384+ buffer size make performance worse if the object is smaller?

@@ -88,6 +94,11 @@ def _store_metric_splits(self, metrics_keys_aggregations_data_offset,
for key, agg, data, offset in keys_aggregations_data_offset:
name = self._get_object_name(
metric, key, agg.method, version)
metric_size = len(data)

MAP_UNAGGREGATED_METRIC_NAME_BY_SIZE[name] = metric_size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not worth adding a check here but i think this is storing more than just unaggregated/raw measures object size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @chungg the mapping will not be synced across all the MetricD agents. However, that is not an issue. In the worst case scenario we would just execute one extra read, as they (the mappings) in the agents will not get out of sync by a huge factor.

The following picture is the usage of IOPs in a Gnocchi setup that has been running for months now with this patch applied. As you can see, there are more writes than reads, which is what happens before this patch.
image

Before this patch, this is the behavior we had:
image

a potential concern may be that depending on the number of metrics assigned to a worker, the MAP_UNAGGREGATED_METRIC_NAME_BY_SIZE lookup may get large and consume (too much) memory? maybe it makes sense to only store metrics we know have objects larger than 8192?

I agree, I will do so this change.

one more thought, what happens if we increase read buffer globally? does hardcoding 16384+ buffer size make performance worse if the object is smaller?

That is a good question. It seems that on Ceph side, we could not reach a conclusion. Moreover, by putting this burden on the operator, he would just bump numbers, without understanding what is going on in the system. That is why we decided to use a smarter approach to record the latest size that was written in an given RADOS object.

BTW, we did some analysis, and we found RADOS objects of size equals to 10M, 20M, 40M. That is why a single global configuration would probably not help much operators.

probably not worth adding a check here but i think this is storing more than just unaggregated/raw measures object size.

Yes, it is. This is on purpose. Depending on how you use Gnocchi, for instance, CloudKitty, and so on., You are constantly affecting the same split. That is why we also added this here.

Copy link
Member

@chungg chungg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! this makes sense to me. will let more active members merge (or will merge if no one else does).

@rafaelweingartner
Copy link
Contributor Author

thanks! this makes sense to me. will let more active members merge (or will merge if no one else does).

Awesome! Thanks for your review!

@@ -88,6 +94,13 @@ def _store_metric_splits(self, metrics_keys_aggregations_data_offset,
for key, agg, data, offset in keys_aggregations_data_offset:
name = self._get_object_name(
metric, key, agg.method, version)
metric_size = len(data)

if metric_size > DEFAULT_RADOS_BUFFER_SIZE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt we keep the old metric_size if it is greater than the new one? it could reduce some problems related to volatile object sizes (which increase and decrease constantly).

If the objects are constantly growing and never gets smaller, maybe using some approach like "if the new size is greater than the current buffer size, I set the new buffer size as two times the new size", it should reduce some unnecessary reads if the rados object gets always bigger.

It is just a suggestion, the overall code seems pretty good to me, good work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. We have not seen this case of objects reducing in size and never growing back. Normally, they will grow up to a certain size, when the back-window is saturated, and they will never go beyond that. That is why we are using the exact value of the length and not any other technique such as using bigger numbers and son on.

I mean, once we reach the maximum RADOS object size according to the limit of the backwindow, the object will maintain that size as the truncate is only executed when the new datapoints are received. Therefore, one new comes in, and one old is deleted.

Copy link
Contributor

@tobias-urdin tobias-urdin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tobias-urdin tobias-urdin merged commit 88ee87d into gnocchixyz:master May 31, 2024
23 checks passed
@rafaelweingartner
Copy link
Contributor Author

@tobias-urdin, thanks for the support here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants