-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taxonomy of backup / redundancy / failsafe levels #527
Comments
Use Cases for Redundancy and their probability
Grouping those use cases
Redundancy in OpenStack ResourcesReplication of VolumesReplication of Volumes can be achieved when either a backend is used, that already provides replication OR it can be defined in a volume type, which then uses two different (instances of) backends. The replication provided here can vary from just being replicated on a different disk/ssd in the same rack or up to mirroring to another Rack and depending on the physical infrastructure up to being mirrored into a different fire zone. Nevertheless volume replication should be seen as the simplest kind of redundancy (Level 1). The one that is used for simple hardware failure such as a disk failure or a node failure. Replication of Objects (Object Storage)Nowadays most Installations do not use Swift directly which means dealing with the Swift internal replication (https://docs.openstack.org/swift/latest/overview_replication.html) is not needed. Instead many deployments use This results in the replication being mainly used as the simplest kind of redundancy (Level 1). Due to the nature of objects, they can be easily replicated and stored by the users themself. So a user can easily store their object data in different locations which would result in a redundancy of Level 4. Replication of server data (VMs)Due to the nature of data in use and their constant changing, it is not easy to provide replication on the IaaS-Layer. There are two different versions of VMs that needs to be considered:
Replication of SecretsThe keys used for volume encryption and stored in Barbican are necessary to decrypt user data. The keys are stored encrypted (a key is encrypted with an project-key-encryption-key and that KEK is again stored encrypted with a master kek) in a simple data base. The Project-KEKs are either stored in a database or in an HSM and the Master-KEK is stored according to the Barbican Plugin. The database is always deployed redundant on different nodes and would survive Level 1 and maybe Level 2 Failure. Barbican should always be deployed redundant too, which makes the Master-KEK the only possible single point of failure. It depends on where the Master KEK is stored (within each Barbican instance or within a Network-HSM-cluster would be failure safe for Level 1 and maybe Level 2): if it is stored in a single HSM a failure of that would render all encrypted data (that is not in use) impossible to access. The CSP could be able to backup the MasterKEK either through the life-cycle tool or through a dedicated backup. Which would help here. |
@mbuechse I can go on and cover the other OpenStack resources or you can. |
@josephineSei Feel free to cover more. I mainly created the issue so we don't forget about the topic. I wanted to bring it up in the IaaS call to collect opinions, but probably not before next week. If you want, you can bring it up even this week. |
IMO: This taxonomy influences several standards in IaaS and KaaS, such as availability zones and regions in OpenStack or base security features in Kubernetes. We should create a task force containing of all stakeholders. |
@josephineSei Is there any state-of-the-art taxonomy? |
Outcome from brainstorming meeting 25.04.24:
Challenges regarding network configuration:
@josephineSei @markus-hentsch: Write DR for taxonomy. |
I don't think that is completely accurate. Volume backends are expected to not lie about data persistence to applications, which means that the data written on volumes is as consistent as a (cacheless) disk power outage. That is something all resilient applications (such as databases and filesystems) develop against (as a "threat" model), so it is a scenario which should be handled reasonably well. In other words, if your SQL database server (such as PostgreSQL) returns successful from a As for the taxonomy, I'm not aware of anything, but I'll ask around some more. I mostly deal with the development side of things in certifications (such as ISO27001) and less with operations. Footnotes
|
I had a look through the ISO27001 norm, but I do not find something specific for the case we want to define. So I begin with the work on the DR. |
I also had a look around the net and was astonished to find almost nothing in terms of a standardized classification scheme in the context of data center infrastructure risks (like illustrated in #527 (comment)). A lot of results are just disguised ads for consultancy services and the few concrete documents that turn up often focus solely on security risks (example). It seems we indeed need to go forward with our own classification for now until we discover something suitable. |
As @horazont suggested I looked into some guidelines of the BSI and found that: But within that document the BSI mostly discusses what we would classify as level 3 or 4 . There is a lot of detailed description about power supply, cooling, protection against nature catastrophies and human threat (== cyber attacks). We (as describing from the point of looking at user workload) may have a broader view and would need another way than this documents describe to classify levels of IaaS failsafeness. |
After disks and memory DIMMs, I have seen switch failures as a somewhat common failure case. |
I am reading through some more documents of the BSI:
What do we want to achieve: |
I propose to use either the slot of the Standardization SIG in a week it does not take place (this week or the 06.06.) or to use next Tuesday. I wrote to Kurt to post it to the ML. |
The session for discussion will take place at 23.05. I wrote a mail to the ML. |
During the breakout session we started by discussing what the main purpose of the taxonomy standard should be, in order to shape the discussion and further research on this topic accordingly. Possible purposes that we discussed:
Discussion result:
|
Do we have a taxonomy of failsafe levels?
For instance, the emerging standard on volume types refers to replication, and in this case, it is mostly to protect against a failure of a storage device, so we specify neither the number of replicas nor whether they should span multiple zones etc. In other cases, however, we might want to protect against power loss or fire or other risks.
So it would be interesting to define multiple levels of "failsafe" that may be applied to replication, backups and the like, and to establish handy nomenclature.
The text was updated successfully, but these errors were encountered: