Taxonomy of failsafe levels #579

josephineSei · 2024-04-26T13:34:26Z

closes #527

Signed-off-by: josephineSei <[email protected]>

markus-hentsch

Good write-up. I added some spelling, phrasing and terminology adjustment suggestions.

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md

Co-authored-by: Markus Hentsch <[email protected]> Signed-off-by: josephineSei <[email protected]>

Signed-off-by: josephineSei <[email protected]>

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md

Signed-off-by: josephineSei <[email protected]>

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md

anjastrunk

IMO, this DR is not in a final state and we should go for another round on discussion

anjastrunk · 2024-04-29T12:37:03Z

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md

+Some standards provided by the SCS will talk about or require procedures to backup resources or have redundancy for resources.
+This decision record should discuss, which failure threats are CSP-facing and will classify them into several levels.


Some glue text is missing here. I would re-phrase the sentence, as follows

Suggested change

Some standards provided by the SCS will talk about or require procedures to backup resources or have redundancy for resources.

This decision record should discuss, which failure threats are CSP-facing and will classify them into several levels.

Some standards provided by the SCS project will talk about or require procedures to backup resources or have redundancy for resources. As these terms are neither officially defined nor intuitive, this decision record tries to get some clarity in this topic. It discuss, which failure threats are cloud service provider facing and classifies them into several levels.

anjastrunk · 2024-04-29T12:38:54Z

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md

+
+## Decision
+
+First there needs to be an overview about possible failure cases in infrastructures:


Suggested change

First there needs to be an overview about possible failure cases in infrastructures:

First there needs to be an overview about possible failure cases in infrastructures as well as their probability of entry and the damage they may cause.

anjastrunk · 2024-04-29T12:40:33Z

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md

+
+| Failure Case | Probability | Consequences |
+|----|-----|----|
+| Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) |


In favor to simplicity, I would assume disk loss/failure will cause permanent loss of data on this disk.

Suggested change

| Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) |

| Disk Failure/Loss | High | Permanent data loss in on this disk. Impact depends on type of lost data (data base, user data) |

anjastrunk · 2024-04-29T12:49:07Z

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md

+| Failure Case | Probability | Consequences |
+|----|-----|----|
+| Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) |
+| Node Outage | Medium to High | Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node)  |


I prefer to differ between Node Failure/Loss, meaning hardware is irrecoverable damaged and node outage, caused by electricity outage, as both use cases cause different implications. Furthermore, we should define node as computation hardware without disks. This facilitates classification of use case.

Suggested change

| Node Outage | Medium to High | Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node) |

| Node Failure/Loss (without disks) | Medium to High | Permanent loss of functionality and connectivity of node (impact depends on type of node) |

| Node Outage | Medium to High | Temporary loss of functionality and connectivity of node (impact depends on type of node) |

anjastrunk · 2024-04-29T12:52:29Z

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md

+|----|-----|----|
+| Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) |
+| Node Outage | Medium to High | Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node)  |
+| Rack Outage | Medium | similar to Disk Failure and Node Outage |


anjastrunk · 2024-04-29T12:55:57Z

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md

+| Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) |
+| Node Outage | Medium to High | Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node)  |
+| Rack Outage | Medium | similar to Disk Failure and Node Outage |
+| Power Outage (Data Center supply)  | Medium | potential data loss, temporary loss of functionality and connectivity of node (impact depends on type of node)  |


As I said, I would omit "data loss" and focus on big consequence. Most protocols are working with acknowledgments. Hence, we can assume, that data loss is temporary. What we really lost are CPU and RAM data, but we should omit these consequences, as we can not prevent or avoid them.

Suggested change

| Power Outage (Data Center supply) | Medium | potential data loss, temporary loss of functionality and connectivity of node (impact depends on type of node) |

| Power Outage (Data Center supply) | Medium | temporary outage of all nodes in rack (impact depends on type of node) |

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md

Signed-off-by: josephineSei <[email protected]>

josephineSei · 2024-05-24T09:04:16Z

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md

+
+:::
+
+| Level/Class | Probability | Failure Causes | loss in IaaS | User Hints |


I am still thinking about the "user hints" column. Putting it next to the other columns is good from some perspectives, as it can be read: I want to achieve 2. Level of failuresafeness, which can be triggered by these failure causes that will result in these losses on IaaS level, so I can do, what is shown in the user hints.
But we wanted the classification not for examples for users, but mainly as a definiton for standards, so maybe we should not reference those standards here.
We could rather use an extra table with example actions(standards, "user has to to things",..) for each level/class or maybe this should rather not be in a decision record, but rather in a guide or so.

I agree with you, @josephineSei. Linking SCS standards in this table may cause a huge synchronization effort. We always have to update this DR, if referenced standards change. I appreciate the column user hints, but would limit to a textual explanation. See my suggestions below...

anjastrunk · 2024-05-27T05:44:12Z

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md

+| (Volume) Snapshot  | Thinly-provisioned copy-on-write snapshots of volumes. Stored in the same Cinder storage backend as volumes.                             |
+| Volume Type        | Attribute of volumes determining storage details of a volume such as backend location or whether the volume will be encrypted.           |
+| (Barbican) Secret  | IaaS resource storing cryptographic assets such as encryption keys. Managed by the Barbican service.                                     |
+| Key Encryption Key | IaaS resource, used to encrypt other keys to be able to store them encrypted in a database.                                              |


I a key encryption key really an IaaS resource? I thought key encryption keys are stored in configuration files and if this is the case, it is a configuration setting.

The Glossary is for all maybe unknown phrases to be described. As this standard also concerns Key Encryption Keys, it should be noted in the glossary

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md

anjastrunk

Suggested some minor changes, but do not want to block PR.

Co-authored-by: anjastrunk <[email protected]> Signed-off-by: josephineSei <[email protected]>

Signed-off-by: josephineSei <[email protected]>

… to scs-XXXX-w1-example-impacts-of-failure-scenarios.md Signed-off-by: josephineSei <[email protected]>

Signed-off-by: josephineSei <[email protected]>

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md

Co-authored-by: Michal Gubricky <[email protected]> Signed-off-by: josephineSei <[email protected]>

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md

josephineSei · 2024-09-10T08:52:44Z

Hey @jschoone or @tonifinger or @DEiselt could someone of you help with the KaaS part of this document (or do you know someone who can?) until the end of the week?

DEiselt · 2024-09-10T08:59:46Z

Hey @jschoone or @tonifinger or @DEiselt could someone of you help with the KaaS part of this document (or do you know someone who can?) until the end of the week?

Hi Josephine, Jan and i will meet today to discuss this and provide feedback :)

Signed-off-by: Jan Schoone <[email protected]>

anjastrunk · 2024-09-16T08:15:12Z

@mbuechse @kgube @michal-gubricky @jschoone: Please review, request changes and/or approve. We want to merge this PR as soon as possible.

josephineSei · 2024-09-25T07:31:40Z

@jschoone @bitkeks @DEiselt @michal-gubricky If you are okay with the total of this decision record, please approve this PR, otherwise please tell me what you think is still missing.

kgube

Some small fixes and suggestions

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md

Signed-off-by: josephineSei <[email protected]>

bitkeks

LGTM, left one comment regarding the usage of RTO.

Good approach of defining the levels, then the potential influences and the many components that are affected. Mapping this all to the SCS technology parts will help CSPs and users to identify those parts of their instance they need to have a look at more specifically.

Of course this document will probably evolve with more empirical data - you could for example discuss, if river floods in Germany are really "very low" probability 😃

Very important also to highlight the need for a CSP-internal risk analysis. There cannot be a generalized approach that applies everywhere!

Signed-off-by: josephineSei <[email protected]>

kgube

LGTM

…nomy-of-failsafe-levels.md Signed-off-by: josephineSei <[email protected]>

… to scs-0118-w1-example-impacts-of-failure-scenarios.md Signed-off-by: josephineSei <[email protected]>

josephineSei added 3 commits April 26, 2024 15:31

[DRAFT] Create scs-XXXX-vN-taxonomy-of-failsafe-levels.md

88ede08

Signed-off-by: josephineSei <[email protected]>

Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md

5e0742d

Signed-off-by: josephineSei <[email protected]>

Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md

9a1c2cd

Signed-off-by: josephineSei <[email protected]>

markus-hentsch reviewed Apr 26, 2024

View reviewed changes

josephineSei and others added 2 commits April 29, 2024 09:51

Apply suggestions from code review

e0c87bf

Co-authored-by: Markus Hentsch <[email protected]> Signed-off-by: josephineSei <[email protected]>

edit more wording

41a75a2

Signed-off-by: josephineSei <[email protected]>

josephineSei requested review from mbuechse, kgube, cah-hbaum and anjastrunk April 29, 2024 07:55

anjastrunk reviewed Apr 29, 2024

View reviewed changes

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved

anjastrunk reviewed Apr 29, 2024

View reviewed changes

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved

change gloassary section to table

020bf8b

Signed-off-by: josephineSei <[email protected]>

berendt reviewed Apr 29, 2024

View reviewed changes

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved

anjastrunk requested changes Apr 29, 2024

View reviewed changes

josephineSei added 3 commits May 2, 2024 13:33

Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md

f0f75cb

Signed-off-by: josephineSei <[email protected]>

Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md

d475eb1

Signed-off-by: josephineSei <[email protected]>

editing table of classifictaion, as we discussed in the meeting

04be929

Signed-off-by: josephineSei <[email protected]>

josephineSei commented May 24, 2024

View reviewed changes

anjastrunk reviewed May 27, 2024

View reviewed changes

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved

anjastrunk reviewed May 27, 2024

View reviewed changes

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved

anjastrunk reviewed May 27, 2024

View reviewed changes

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved

anjastrunk reviewed May 27, 2024

View reviewed changes

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved

anjastrunk reviewed May 27, 2024

View reviewed changes

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved

anjastrunk reviewed May 27, 2024

View reviewed changes

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved

anjastrunk reviewed May 27, 2024

View reviewed changes

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved

anjastrunk approved these changes May 27, 2024

View reviewed changes

josephineSei and others added 2 commits May 28, 2024 08:28

Apply suggestions from code review

367d992

Co-authored-by: anjastrunk <[email protected]> Signed-off-by: josephineSei <[email protected]>

Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md

b190440

Signed-off-by: josephineSei <[email protected]>

josephineSei added 2 commits August 26, 2024 10:14

Update and rename scs-XXXX-v1-example-impacts-of-failure-scenarios.md…

1f3de87

… to scs-XXXX-w1-example-impacts-of-failure-scenarios.md Signed-off-by: josephineSei <[email protected]>

Update scs-XXXX-w1-example-impacts-of-failure-scenarios.md

2a492f8

Signed-off-by: josephineSei <[email protected]>

josephineSei requested review from jschoone and DEiselt August 30, 2024 11:10

michal-gubricky self-requested a review September 4, 2024 11:46

martinmo removed their assignment Sep 5, 2024

michal-gubricky reviewed Sep 6, 2024

View reviewed changes

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved

Apply suggestions from code review

53c6521

Co-authored-by: Michal Gubricky <[email protected]> Signed-off-by: josephineSei <[email protected]>

michal-gubricky reviewed Sep 9, 2024

View reviewed changes

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved

Merge branch 'main' into taxonomy-of-failsafe-levels

90e311d

jschoone added 2 commits September 10, 2024 11:55

fix(kaas): use PV instead of PVC as this is actually the Volume

0e39254

Signed-off-by: Jan Schoone <[email protected]>

feat(kaas): first proposal for levels on kaas layer

2a52226

Signed-off-by: Jan Schoone <[email protected]>

anjastrunk self-requested a review September 16, 2024 08:13

Merge branch 'main' into taxonomy-of-failsafe-levels

1fbab3a

michal-gubricky approved these changes Sep 25, 2024

View reviewed changes

DEiselt approved these changes Sep 25, 2024

View reviewed changes

kgube requested changes Sep 25, 2024

View reviewed changes

bitkeks reviewed Sep 25, 2024

View reviewed changes

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved

Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md

5ffe31a

Signed-off-by: josephineSei <[email protected]>

bitkeks approved these changes Sep 25, 2024

View reviewed changes

Update scs-XXXX-vN-taxonomy-of-failsafe-levels.md

7931127

Signed-off-by: josephineSei <[email protected]>

kgube approved these changes Sep 25, 2024

View reviewed changes

josephineSei added 2 commits September 25, 2024 16:06

Rename scs-XXXX-vN-taxonomy-of-failsafe-levels.md to scs-0118-v1-taxo…

ee531ad

…nomy-of-failsafe-levels.md Signed-off-by: josephineSei <[email protected]>

Update and rename scs-XXXX-w1-example-impacts-of-failure-scenarios.md…

37ec252

… to scs-0118-w1-example-impacts-of-failure-scenarios.md Signed-off-by: josephineSei <[email protected]>

josephineSei merged commit 79e291d into main Sep 25, 2024
7 checks passed

josephineSei deleted the taxonomy-of-failsafe-levels branch September 25, 2024 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taxonomy of failsafe levels #579

Taxonomy of failsafe levels #579

josephineSei commented Apr 26, 2024

markus-hentsch left a comment

anjastrunk left a comment

anjastrunk Apr 29, 2024

anjastrunk Apr 29, 2024

anjastrunk Apr 29, 2024

anjastrunk Apr 29, 2024

anjastrunk Apr 29, 2024

anjastrunk Apr 29, 2024

josephineSei May 24, 2024

anjastrunk May 27, 2024

anjastrunk May 27, 2024

josephineSei May 28, 2024

anjastrunk left a comment

josephineSei commented Sep 10, 2024

DEiselt commented Sep 10, 2024

anjastrunk commented Sep 16, 2024

josephineSei commented Sep 25, 2024

kgube left a comment

bitkeks left a comment

kgube left a comment

		Some standards provided by the SCS will talk about or require procedures to backup resources or have redundancy for resources.
		This decision record should discuss, which failure threats are CSP-facing and will classify them into several levels.

	Some standards provided by the SCS will talk about or require procedures to backup resources or have redundancy for resources.
	This decision record should discuss, which failure threats are CSP-facing and will classify them into several levels.
	Some standards provided by the SCS project will talk about or require procedures to backup resources or have redundancy for resources. As these terms are neither officially defined nor intuitive, this decision record tries to get some clarity in this topic. It discuss, which failure threats are cloud service provider facing and classifies them into several levels.


		## Decision

		First there needs to be an overview about possible failure cases in infrastructures:

	\| Disk Failure/Loss \| High \| Data loss on this disk. Impact depends on type of lost data (data base, user data) \|
	\| Disk Failure/Loss \| High \| Permanent data loss in on this disk. Impact depends on type of lost data (data base, user data) \|

	\| Node Outage \| Medium to High \| Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node) \|
	\| Node Failure/Loss (without disks) \| Medium to High \| Permanent loss of functionality and connectivity of node (impact depends on type of node) \|
	\| Node Outage \| Medium to High \| Temporary loss of functionality and connectivity of node (impact depends on type of node) \|

	\| Rack Outage \| Medium \| similar to Disk Failure and Node Outage \|
	\| Rack Outage \| Medium \| Outage of all nodes in rack \|

	\| Power Outage (Data Center supply) \| Medium \| potential data loss, temporary loss of functionality and connectivity of node (impact depends on type of node) \|
	\| Power Outage (Data Center supply) \| Medium \| temporary outage of all nodes in rack (impact depends on type of node) \|


		:::

		\| Level/Class \| Probability \| Failure Causes \| loss in IaaS \| User Hints \|

Taxonomy of failsafe levels #579

Taxonomy of failsafe levels #579

Conversation

josephineSei commented Apr 26, 2024

markus-hentsch left a comment

Choose a reason for hiding this comment

anjastrunk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anjastrunk left a comment

Choose a reason for hiding this comment

josephineSei commented Sep 10, 2024

DEiselt commented Sep 10, 2024

anjastrunk commented Sep 16, 2024

josephineSei commented Sep 25, 2024

kgube left a comment

Choose a reason for hiding this comment

bitkeks left a comment

Choose a reason for hiding this comment

kgube left a comment

Choose a reason for hiding this comment