-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address high IOPs usage of the Gnocchi Ceph pool #1381
Address high IOPs usage of the Gnocchi Ceph pool #1381
Conversation
Before I describe the situation, let me put us all in the same page regarding the concepts we are dealing with. This patch is about the backend Ceph that can be used to store processed and raw measurements for Gnocchi. Ceph is a software defined storage, which when deployed implements a concept called Reliable Autonomous Distributed Object Store (RADOS). Do not confuse this RADOS with RadosGW, which is the sub-system that implements S3 API to consume a Ceph backend. In Ceph we have the RADOS objects, which are different from RadosGW objects (S3 objects).The RADOS objects depend on the underlying system that is consuming them; and, they (the RADOS objects) are the building blocks of any Ceph cluster. For instance, when using Rados Block Device (RBD), the libRBD or the KRBD use by default 4MiB Rados object size. Each IOP shown by Ceph is an operation either read or write of a RADOS object. The RADOS object can be customized/used differently depending on the system that consumes Ceph. Different from systems that consume Ceph via a standard protocol such as RBD or CephFS (which mounts a Ceph pool as a POSIX file system), Gnocchi consumes Ceph natively; I mean, Gnocchi interacts directly with the low level RADOS objects. Every metric (either processed or raw) are stored in a single RADOS object; processed metrics are stored in different files according to their time frames (time splits). Differently from other systems where there is a standard size for the RADOS objects, Gnocchi handles the files in an isolated fashion. Therefore, for some metrics there are RADOS object bigger or smaller depending on the volume of data we have for the given metric and time-frame. Gnocchi uses LIBRADOS [1] to interact with a Ceph backend. When writing a raw metric, Gnocchi uses the method [2], which writes all dataset in a RADOS object. That write represents (is counted by Ceph) one (1) IOP operation; it does not matter if it is a dataset of 1k, 1M, or 10MB, it will be a single write operation. On the other hand, when reading, Gnocchi uses the method [3]; as one can see, the read operation does not read the complete file in a single operation. It will read the data in pieces, and the default chunk size is 8k. This can cause high READ IOPs in certain cases, such as when we have raw metrics for a one year backwindow. The proposal to address this situation is to add an adaptative read process for Gnocchi when it uses Ceph as a backend. I mean, we store the size of the RADOS file for each metric, and then we use the size of the file to configure the read buffer. This will make Gnocchi to reduce the number of read operations in the Ceph cluster. [1] https://docs.ceph.com/en/latest/rados/api/python/ [2] https://docs.ceph.com/en/latest/rados/api/python#rados.Ioctx.write_full [3] https://docs.ceph.com/en/latest/rados/api/python/#rados.Ioctx.read
Add Guto's suggestion. Co-authored-by: Daniel Augusto Veronezi Salvador <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the detailed context! i haven't touched ceph in years so i won't comment on whether this makes sense.
please correct me if i'm wrong but can you confirm that no matter what value we set read buffer size, it will read everything it needs? asking because i can see MAP_UNAGGREGATED_METRIC_NAME_BY_SIZE getting out of sync across workers.
a potential concern may be that depending on the number of metrics assigned to a worker, the MAP_UNAGGREGATED_METRIC_NAME_BY_SIZE lookup may get large and consume (too much) memory? maybe it makes sense to only store metrics we know have objects larger than 8192?
one more thought, what happens if we increase read buffer globally? does hardcoding 16384+ buffer size make performance worse if the object is smaller?
gnocchi/storage/ceph.py
Outdated
@@ -88,6 +94,11 @@ def _store_metric_splits(self, metrics_keys_aggregations_data_offset, | |||
for key, agg, data, offset in keys_aggregations_data_offset: | |||
name = self._get_object_name( | |||
metric, key, agg.method, version) | |||
metric_size = len(data) | |||
|
|||
MAP_UNAGGREGATED_METRIC_NAME_BY_SIZE[name] = metric_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably not worth adding a check here but i think this is storing more than just unaggregated/raw measures object size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, @chungg the mapping will not be synced across all the MetricD agents. However, that is not an issue. In the worst case scenario we would just execute one extra read, as they (the mappings) in the agents will not get out of sync by a huge factor.
The following picture is the usage of IOPs in a Gnocchi setup that has been running for months now with this patch applied. As you can see, there are more writes than reads, which is what happens before this patch.
Before this patch, this is the behavior we had:
a potential concern may be that depending on the number of metrics assigned to a worker, the MAP_UNAGGREGATED_METRIC_NAME_BY_SIZE lookup may get large and consume (too much) memory? maybe it makes sense to only store metrics we know have objects larger than 8192?
I agree, I will do so this change.
one more thought, what happens if we increase read buffer globally? does hardcoding 16384+ buffer size make performance worse if the object is smaller?
That is a good question. It seems that on Ceph side, we could not reach a conclusion. Moreover, by putting this burden on the operator, he would just bump numbers, without understanding what is going on in the system. That is why we decided to use a smarter approach to record the latest size that was written in an given RADOS object.
BTW, we did some analysis, and we found RADOS objects of size equals to 10M, 20M, 40M. That is why a single global configuration would probably not help much operators.
probably not worth adding a check here but i think this is storing more than just unaggregated/raw measures object size.
Yes, it is. This is on purpose. Depending on how you use Gnocchi, for instance, CloudKitty, and so on., You are constantly affecting the same split. That is why we also added this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! this makes sense to me. will let more active members merge (or will merge if no one else does).
Awesome! Thanks for your review! |
@@ -88,6 +94,13 @@ def _store_metric_splits(self, metrics_keys_aggregations_data_offset, | |||
for key, agg, data, offset in keys_aggregations_data_offset: | |||
name = self._get_object_name( | |||
metric, key, agg.method, version) | |||
metric_size = len(data) | |||
|
|||
if metric_size > DEFAULT_RADOS_BUFFER_SIZE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldnt we keep the old metric_size if it is greater than the new one? it could reduce some problems related to volatile object sizes (which increase and decrease constantly).
If the objects are constantly growing and never gets smaller, maybe using some approach like "if the new size is greater than the current buffer size, I set the new buffer size as two times the new size", it should reduce some unnecessary reads if the rados object gets always bigger.
It is just a suggestion, the overall code seems pretty good to me, good work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point. We have not seen this case of objects reducing in size and never growing back. Normally, they will grow up to a certain size, when the back-window is saturated, and they will never go beyond that. That is why we are using the exact value of the length and not any other technique such as using bigger numbers and son on.
I mean, once we reach the maximum RADOS object size according to the limit of the backwindow, the object will maintain that size as the truncate is only executed when the new datapoints are received. Therefore, one new comes in, and one old is deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@tobias-urdin, thanks for the support here! |
Before I describe the situation, let me put us all in the same page regarding the concepts we are dealing with. This patch is about the backend Ceph that can be used to store processed and raw measurements for Gnocchi. Ceph is a software defined storage, which when deployed implements a concept called Reliable Autonomous Distributed Object Store (RADOS). Do not confuse this RADOS with RadosGW, which is the sub-system that implements S3 API to consume a Ceph backend. In Ceph we have the RADOS objects, which are different from RadosGW objects (S3 objects).The RADOS objects depend on the underlying system that is consuming them; and, they (the RADOS objects) are the building blocks of any Ceph cluster. For instance, when using Rados Block Device (RBD), the libRBD or the KRBD use by default 4MiB Rados object size. Each IOP shown by Ceph is an operation either read or write of a RADOS object. The RADOS object can be customized/used differently depending on the system that consumes Ceph.
Different from systems that consume Ceph via a standard protocol such as RBD or CephFS (which mounts a Ceph pool as a POSIX file system), Gnocchi consumes Ceph natively; I mean, Gnocchi interacts directly with the low level RADOS objects. Every metric (either processed or raw) are stored in a single RADOS object; processed metrics are stored in different files according to their time frames (time splits). Differently from other systems where there is a standard size for the RADOS objects, Gnocchi handles the files in an isolated fashion. Therefore, for some metrics there are RADOS object bigger or smaller depending on the volume of data we have for the given metric and time-frame.
Gnocchi uses LIBRADOS [1] to interact with a Ceph backend. When writing a raw metric, Gnocchi uses the method [2], which writes all dataset in a RADOS object. That write represents (is counted by Ceph) one (1) IOP operation; it does not matter if it is a dataset of 1k, 1M, or 10MB, it will be a single write operation. On the other hand, when reading, Gnocchi uses the method [3]; as one can see, the read operation does not read the complete file in a single operation. It will read the data in pieces, and the default chunk size is 8k. This can cause high READ IOPs in certain cases, such as when we have raw metrics for a one year backwindow.
The proposal to address this situation is to add an adaptative read process for Gnocchi when it uses Ceph as a backend. I mean, we store the size of the RADOS file for each metric, and then we use the size of the file to configure the read buffer. This will make Gnocchi to reduce the number of read operations in the Ceph cluster.
The following picture demonstrates the difference between the standard Gnocchi Ceph code, and the proposed solution. Furthermore, in beige color, there is an example of a further improvement, which is achieved together with this code and some tuning such as disabling the "greedy" option in Gnocchi and increasing the interval between MetricD processing from 60s to 300s.
The spikes shown in the picture, which are highlighted with a star are a consequence of the code. I mean, in the worst case scenario, in the first run, the system will not have "learned" the RADOS object size. Therefore, the read is not optimal. After the first round of processing, the system will learn the pattern, and then the reads are improved.
[1] https://docs.ceph.com/en/latest/rados/api/python/
[2] https://docs.ceph.com/en/latest/rados/api/python#rados.Ioctx.write_full
[3] https://docs.ceph.com/en/latest/rados/api/python/#rados.Ioctx.read