-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New deployment and day-2-ops tooling software defined storage (ceph) - ADR #515
New deployment and day-2-ops tooling software defined storage (ceph) - ADR #515
Comments
For task "Gather information about reference setups from cloud providers (critera, migration path)" questions to cloud providers and/or customersWe want to have a better understanding which Ceph setups, deployed by OSISM, you are currently running. We hope that this input helps us to decide on how to move forward on a possible replacement of ceph-ansible in OSISM.
|
We are using Rook for all our new Ceph deployments. Previously, we used Ceph-Chef. Deployment Method: Rook is the natural choice for us because we are running Kubernetes on bare metal already for YAOOK Operator. It integrates well with YAOOK because both of them are using Kubernetes. In addition, we have made excellent experience with the performance, maintainability and reliability of Rook.io clusters, in particular compared to our previous static deployment method (Ceph-Chef). All methods have their downsides, and so does the Rook method. In partiuclar:
Version: With Rook, we are running 16.x with the plan to upgrade to 17.x soon-ish, though we are blocked there for non-Ceph and non-Rook reasons. Hardware: Varying and historically grown, I'd have to look that up. Hit me up via email if you need that information: mailto:[email protected]. Features: We use RBD exclusively with Rook so far (see above), we intend to enable S3 and Swift frontends once we implemented support for that (currently, these needs are served by our old Ceph-Chef cluster). We use CephFS in non-bare-metal cases, too. |
We use the Quincy release (17.2.6) provided by OSISM 6.0.2 everywhere We have a single small hyperconverged cluster for a specific customer workload. Otherwise we only use dedicated Ceph clusters. We currently have a single cluster that provides HDD and NVMe SSD as RBDs for Cinder/Nova. In addition, we have a cluster that is used exclusively for RGW and is offered as a Swift and S3 endpoint (integrated in Keystone and Glance, in future also Cinder (for backups)). At the moment we are deploying the control plane on the Ceph OSD nodes and do not have any dedicated nodes for the control plane. We also do not split the data plane and control plane on the network side. We currently have 2x 100G in the Ceph nodes there. The compute nodes have 2x 25G (will be 2x 100G in the future as well). Latencies between the nodes are approx. 0.05ms (ICMP). We have a separate pool for each OpenStack service (images, vms, volumes). We have several pools for Cinder so that we can partially separate customers. We use the following services: osd, mon, mgr, rgw, crash. We would also like to take a look at mds in the future in order to be able to offer CephFS via Manila if necessary. We can share details about hardware and the configuration in full if required. We do not optimize the systems directly with the Ceph-Ansible part of OSISM, but use the tuned, sysctl and network roles from OSISM for this. We are satisfied with what we can currently do with OSISM. We would only need more functionality in Day2 Operations in the future. We have also recently added the option of deploying Kubernetes directly on all nodes in OSISM. We are open for Rook and cephadm. We are currently tending towards Rook as we believe it is the more consistent step. |
@flyersa Can you give feedback as well? I think, it would be helpful. |
Maybe we will switch to a hyper-converged setup (compute/storage/maybe network) and 25G interfaces in the future. |
Which ceph release are you running?
What is the size of your ceph cluster?
Are Ceph workloads sharing the hardware with other workloads (hyperconverged)? If yes, why?
Are you running multiple pools or even multiple clusters? If yes, why?
Which ceph features/daemons are you using and how are they integrated into OpenStack and/or other services?
Wishlist:
Which hardware are you using (either sizing or specs)?
HDDs/SSDs/NVMEs(/Controllers)
Are you splitting "OSD setup" and "BlueStore WAL+DB" NICs/speed/latency
Which Ceph config is deployed by OSISM?
Which Ceph config is deployed "unknown to" or "on top of" OSISM"? e.g. special crush maps, special configs
Would it be nice to have more Ceph features deployable via OSISM?
What is your justified opinion on a new deployment method for Ceph (instead of ceph-ansible)?
|
Which ceph release are you running?
What is the size of your ceph cluster?
Are Ceph workloads sharing the hardware with other workloads (hyperconverged)? If yes, why?
Are you running multiple pools or even multiple clusters? If yes, why?
Which ceph features/daemons are you using and how are they integrated into OpenStack and/or other services?
Which hardware are you using (either sizing or specs)? Mainly HPE such as Apollo 4200 or similar Are you splitting "OSD setup" and "BlueStore WAL+DB" of course NICs/speed/latency 2x 40GIG or 4x 10Gig, depending on scenario and potential troughput Are you using splitting Dataplane and Controlplane? no, monitors and mgr usually go on the storage nodes Which Ceph config is deployed by OSISM? none, we never deploy with osism and use cephadm. We had customer faults based on user error in the past already damaging ceph clusters, thus we focus on a strong seperation from storage and openstack. Would it be nice to have more Ceph features deployable via OSISM? for others maybe, as i said that does not belong into the same system that i manage my compute resources for various operational topics What is your justified opinion on a new deployment method for Ceph (instead of ceph-ansible)? We should use what is used upstream, for ceph the tool is now cephadm, so of course we should use this What about Rook? While rook adds alot in regards to fault tolerance and so on it adds complexity too, i am not a huge fan of rook, in a CSP environment usually you have dedicated servers (if not HCI) for storage, no need to add a k8s cluster on top of it... |
Our current decision tracking is done here: https://input.scs.community/3aZ-xdnRS-y11lZkrtAvxw |
btw. another point to freaking get rid of ceph-ansible... Ever did an upgrade? In the time this crap takes alone to upgrade a single monitor i upgrade complete datacenters to a new ceph version with cephadm... |
As a SCS Operator, I want a well considered and justified decision for a reliable method to deploy and operate ceph to replace ceph-ansible.
Criteria:
Tasks (see decision tracking document for detailed status):
Definition of Done:
Decision tracking document
The text was updated successfully, but these errors were encountered: