Taxonomy of backup / redundancy / failsafe levels #527

mbuechse · 2024-03-18T13:39:15Z

Do we have a taxonomy of failsafe levels?

For instance, the emerging standard on volume types refers to replication, and in this case, it is mostly to protect against a failure of a storage device, so we specify neither the number of replicas nor whether they should span multiple zones etc. In other cases, however, we might want to protect against power loss or fire or other risks.

So it would be interesting to define multiple levels of "failsafe" that may be applied to replication, backups and the like, and to establish handy nomenclature.

josephineSei · 2024-03-26T13:31:28Z

Use Cases for Redundancy and their probability

Use Case	Probability	Consequences
Disk Failure/Loss	High	Data loss on this disk. Impact depends on type of lost data (data base, user data)
Node Outage	Medium to High	Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node)
Rack Outage	Medium	similar to Disk Failure and Node Outage
Power Outage (Data Center supply)	Medium	potential data loss, temporary loss of functionality and connectivity of node (impact depends on type of node)
Fire	Medium	permanently Disk and Node loss in the affected zone
Flood	Low	permanently Disk and Node loss in the affected zone
Earthquake	Very Low	permanently Disk and Node loss in the affected zone
Storm/Tornado	Low	permanently Disk and Node loss in the affected fire zone
Cyber threat	High	permanently loss of data on affected Disk and Node

Grouping those use cases

Group	level of affection	Use Cases
1. Level	single volumes, VMs...	Disk Failure, Node outage, (maybe rack outage)
2. Level	number of resources, most of the time recoverable	Rack outage, (Fire), (Power outage when different power supplies exist)
3. Level	lots of resources / user data + potentially not recoverable	Fire, Earthquake, Storm/Tornado, Power Outage
4. Level	complete deployment, not recoverable	Flood, Fire

Redundancy in OpenStack Resources

Replication of Volumes

Replication of Volumes can be achieved when either a backend is used, that already provides replication OR it can be defined in a volume type, which then uses two different (instances of) backends.

The replication provided here can vary from just being replicated on a different disk/ssd in the same rack or up to mirroring to another Rack and depending on the physical infrastructure up to being mirrored into a different fire zone.

Nevertheless volume replication should be seen as the simplest kind of redundancy (Level 1). The one that is used for simple hardware failure such as a disk failure or a node failure.

Replication of Objects (Object Storage)

Nowadays most Installations do not use Swift directly which means dealing with the Swift internal replication (https://docs.openstack.org/swift/latest/overview_replication.html) is not needed.

Instead many deployments use ~~RBD~~ RadosGW with a ceph backend. Here the internal ceph replication can be used.

This results in the replication being mainly used as the simplest kind of redundancy (Level 1). Due to the nature of objects, they can be easily replicated and stored by the users themself. So a user can easily store their object data in different locations which would result in a redundancy of Level 4.

Replication of server data (VMs)

Due to the nature of data in use and their constant changing, it is not easy to provide replication on the IaaS-Layer.

There are two different versions of VMs that needs to be considered:

ephemeral storage VMs
volume-based VMs
The ephemeral storage is stored directly on the compute node. This in combination with the data being in used and updated makes replication impossible. Everything that should be redundant, MUST be handled by the IaaS-User.
Volume-based VMs make use of possible replication of the Volumes. As soon as new data is written or updated, the blocks are automatically replicated through the volume storage solution. This means replication of Level 1 would be given. BUT: due to the data being in use and blocks being written instead of files, the consistency of the data is not always given, and there will be a "short" delay while transferring data from the compute node to the storage backend.
In this case users will always have the need to check for consistency themself and be aware of having maybe outdated (up to a few minutes) data. In case of a node outage (compute host) there is a good chance of being able to reconstruct user data from just a few minutes ago.
For any higher level of redundancy users SHOULD involve redundancy mechanisms on higher layers than IaaS.

Replication of Secrets

The keys used for volume encryption and stored in Barbican are necessary to decrypt user data. The keys are stored encrypted (a key is encrypted with an project-key-encryption-key and that KEK is again stored encrypted with a master kek) in a simple data base. The Project-KEKs are either stored in a database or in an HSM and the Master-KEK is stored according to the Barbican Plugin. The database is always deployed redundant on different nodes and would survive Level 1 and maybe Level 2 Failure. Barbican should always be deployed redundant too, which makes the Master-KEK the only possible single point of failure. It depends on where the Master KEK is stored (within each Barbican instance or within a Network-HSM-cluster would be failure safe for Level 1 and maybe Level 2): if it is stored in a single HSM a failure of that would render all encrypted data (that is not in use) impossible to access.

The CSP could be able to backup the MasterKEK either through the life-cycle tool or through a dedicated backup. Which would help here.

josephineSei · 2024-03-26T13:33:39Z

@mbuechse I can go on and cover the other OpenStack resources or you can.

mbuechse · 2024-03-26T14:04:57Z

@josephineSei Feel free to cover more. I mainly created the issue so we don't forget about the topic. I wanted to bring it up in the IaaS call to collect opinions, but probably not before next week. If you want, you can bring it up even this week.

anjastrunk · 2024-04-15T06:58:51Z

IMO: This taxonomy influences several standards in IaaS and KaaS, such as availability zones and regions in OpenStack or base security features in Kubernetes. We should create a task force containing of all stakeholders.

anjastrunk · 2024-04-25T11:11:44Z

@josephineSei Is there any state-of-the-art taxonomy?

anjastrunk · 2024-04-25T11:33:05Z

Outcome from brainstorming meeting 25.04.24:

We should differ between temporary loss and permanent loss
Cloud resources (IaaS):
- user data on volumes
- user data on images
- user data on RAM/CPU
- volume-based VMs
- ephemeral-storage-based VMs
- secrets in data base
- network configuration data (router, ports, security groups, ...) in data base
- network connectivity (infrastructure materialized from network configuration data in data base)
- floating IPs

Challenges regarding network configuration:

Assume following use case: volume-based VM is running on host, which will be powered off. Re-start of VM on new host and re-building of network connectivity will not take place automatically. It must be triggered manually.

@josephineSei @markus-hentsch: Write DR for taxonomy.

horazont · 2024-04-26T06:47:53Z

Volume-based VMs make use of possible replication of the Volumes. As soon as new data is written or updated, the blocks are automatically replicated through the volume storage solution. This means replication of Level 1 would be given. BUT: due to the data being in use and blocks being written instead of files, the consistency of the data is not always given, and there will be a "short" delay while transferring data from the compute node to the storage backend.

In this case users will always have the need to check for consistency themself and be aware of having maybe outdated (up to a few minutes) data. In case of a node outage (compute host) there is a good chance of being able to reconstruct user data from just a few minutes ago.

I don't think that is completely accurate. Volume backends are expected to not lie about data persistence to applications, which means that the data written on volumes is as consistent as a (cacheless) disk power outage. That is something all resilient applications (such as databases and filesystems) develop against (as a "threat" model), so it is a scenario which should be handled reasonably well.

In other words, if your SQL database server (such as PostgreSQL) returns successful from a COMMIT, you can expect, even on Volume storage, that if you pull the plug from the hypervisor and floor(N/2) of the Ceph replicas immediately afterwards and spin it up on a different hypervisor with the same volume, the committed data is in fact available ¹.

As for the taxonomy, I'm not aware of anything, but I'll ask around some more. I mostly deal with the development side of things in certifications (such as ISO27001) and less with operations.

I think there are in fact cache modes one can set in OpenStack which break this guarantee. Don't do that. ↩

josephineSei · 2024-04-26T11:17:08Z

I had a look through the ISO27001 norm, but I do not find something specific for the case we want to define. So I begin with the work on the DR.

markus-hentsch · 2024-04-26T14:32:49Z

I had a look through the ISO27001 norm, but I do not find something specific for the case we want to define. So I begin with the work on the DR.

I also had a look around the net and was astonished to find almost nothing in terms of a standardized classification scheme in the context of data center infrastructure risks (like illustrated in #527 (comment)).
The Data centre tier list often comes up but is rather the resulting classification of the data center in how well it manages the risks - not a classification of risks and implications themselves, as far as I understand it.

A lot of results are just disguised ads for consultancy services and the few concrete documents that turn up often focus solely on security risks (example). It seems we indeed need to go forward with our own classification for now until we discover something suitable.

josephineSei · 2024-04-29T11:51:20Z

As @horazont suggested I looked into some guidelines of the BSI and found that:
https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/RZ-Sicherheit/RZ-Verfuegbarkeitsmassnahmen.pdf?__blob=publicationFile&v=9

But within that document the BSI mostly discusses what we would classify as level 3 or 4 . There is a lot of detailed description about power supply, cooling, protection against nature catastrophies and human threat (== cyber attacks).

We (as describing from the point of looking at user workload) may have a broader view and would need another way than this documents describe to classify levels of IaaS failsafeness.

garloff · 2024-05-02T12:34:29Z

After disks and memory DIMMs, I have seen switch failures as a somewhat common failure case.

josephineSei · 2024-05-03T08:47:33Z

I am reading through some more documents of the BSI:

This document describes in detail what redundancy can be and which forms exist. It also describes other forms like separation, etc... But there are no classes of failure safety described.
This document describes in detail what risks exist, what ressources they affect and what their probability is. This is in detail what we have done in the first table - but it does not offer consolidated classes (which may not be possible, because the BSI is targeting all kind of IT infrastructure not just IaaS and PaaS/CaaS)
This document also describes risks in detail.

What do we want to achieve:
We want to easily show to everyone reading a standard, that there are different cases of redundancy and to which extend these cases provide safety against failure. It should be clear that (without explicitly writing it down every time) a replicated volume will grant safety regarding a small hardware failure but not against nature catastrophe that will destroy the whole data center.
In this way it should be clear for users who wants to protect themself against a Level 4 failure, that there is no CSP that can do anything about it (unless replication in another geographically distanced data center is possible). So here it will always be part of the user to protect themself against such failure cases.

josephineSei · 2024-05-21T08:15:53Z

I propose to use either the slot of the Standardization SIG in a week it does not take place (this week or the 06.06.) or to use next Tuesday. I wrote to Kurt to post it to the ML.

josephineSei · 2024-05-22T12:32:36Z

The session for discussion will take place at 23.05. I wrote a mail to the ML.

markus-hentsch · 2024-05-23T12:59:20Z

The session for discussion will take place at 23.05. I wrote a mail to the ML.

During the breakout session we started by discussing what the main purpose of the taxonomy standard should be, in order to shape the discussion and further research on this topic accordingly.

Possible purposes that we discussed:

accompanying document for standards, can be used as reasoning for standard decisions in other standards as well as classification reference mentioned in individual standards
documentation for users, clarifying which risks are addressed within a SCS-compliant cloud
- "how does the SCS project attempt to address risk X?"
documentation for CSPs
- "which risk protection do I achieve as a CSP when I apply all SCS standards?"
guidelines/documentation

Discussion result:

primarily 1
also 2 and 3 but not explicitly as part of the taxonomy standard, but rather implicitly resulting from references within individual standards to the classifications defined in the taxonomy standard
4 is mostly out of scope for SCS, we could give small hints here only

mbuechse added question Further information is requested needs refinement User stories that need to be refined for further progress SCS is standardized SCS is standardized SCS-VP10 Related to tender lot SCS-VP10 labels Mar 18, 2024

mbuechse added this to the R7 (v8.0.0) milestone Mar 18, 2024

mbuechse self-assigned this Mar 18, 2024

This was referenced Mar 25, 2024

[EPIC] IaaS standards #285

Open

Define different levels of redundancy and failure safety. #540

Closed

markus-hentsch mentioned this issue Apr 16, 2024

IaaS standard on user data backup #541

Closed

5 tasks

kgube mentioned this issue Apr 22, 2024

standardizing IPv4 networking in SCS #522

Draft

anjastrunk assigned josephineSei and unassigned mbuechse Apr 25, 2024

josephineSei linked a pull request Apr 26, 2024 that will close this issue

Taxonomy of failsafe levels #579

Open

josephineSei mentioned this issue May 22, 2024

Availability Zones: standardized levels of independecies. #539

Open

5 tasks

cah-hbaum mentioned this issue May 23, 2024

Create v2 of node distribution standard (issues/#494) #524

Merged

anjastrunk assigned michal-gubricky Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taxonomy of backup / redundancy / failsafe levels #527

Taxonomy of backup / redundancy / failsafe levels #527

mbuechse commented Mar 18, 2024

josephineSei commented Mar 26, 2024 •

edited by horazont

Loading

josephineSei commented Mar 26, 2024

mbuechse commented Mar 26, 2024

anjastrunk commented Apr 15, 2024

anjastrunk commented Apr 25, 2024

anjastrunk commented Apr 25, 2024 •

edited

Loading

horazont commented Apr 26, 2024

josephineSei commented Apr 26, 2024

markus-hentsch commented Apr 26, 2024

josephineSei commented Apr 29, 2024

garloff commented May 2, 2024

josephineSei commented May 3, 2024

josephineSei commented May 21, 2024

josephineSei commented May 22, 2024

markus-hentsch commented May 23, 2024

Taxonomy of backup / redundancy / failsafe levels #527

Taxonomy of backup / redundancy / failsafe levels #527

Comments

mbuechse commented Mar 18, 2024

josephineSei commented Mar 26, 2024 • edited by horazont Loading

Use Cases for Redundancy and their probability

Grouping those use cases

Redundancy in OpenStack Resources

Replication of Volumes

Replication of Objects (Object Storage)

Replication of server data (VMs)

Replication of Secrets

josephineSei commented Mar 26, 2024

mbuechse commented Mar 26, 2024

anjastrunk commented Apr 15, 2024

anjastrunk commented Apr 25, 2024

anjastrunk commented Apr 25, 2024 • edited Loading

horazont commented Apr 26, 2024

Footnotes

josephineSei commented Apr 26, 2024

markus-hentsch commented Apr 26, 2024

josephineSei commented Apr 29, 2024

garloff commented May 2, 2024

josephineSei commented May 3, 2024

josephineSei commented May 21, 2024

josephineSei commented May 22, 2024

markus-hentsch commented May 23, 2024

josephineSei commented Mar 26, 2024 •

edited by horazont

Loading

anjastrunk commented Apr 25, 2024 •

edited

Loading