-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taxonomy of failsafe levels #579
Conversation
Signed-off-by: josephineSei <[email protected]>
Signed-off-by: josephineSei <[email protected]>
Signed-off-by: josephineSei <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good write-up. I added some spelling, phrasing and terminology adjustment suggestions.
Co-authored-by: Markus Hentsch <[email protected]> Signed-off-by: josephineSei <[email protected]>
Signed-off-by: josephineSei <[email protected]>
Signed-off-by: josephineSei <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this DR is not in a final state and we should go for another round on discussion
Some standards provided by the SCS will talk about or require procedures to backup resources or have redundancy for resources. | ||
This decision record should discuss, which failure threats are CSP-facing and will classify them into several levels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some glue text is missing here. I would re-phrase the sentence, as follows
Some standards provided by the SCS will talk about or require procedures to backup resources or have redundancy for resources. | |
This decision record should discuss, which failure threats are CSP-facing and will classify them into several levels. | |
Some standards provided by the SCS project will talk about or require procedures to backup resources or have redundancy for resources. As these terms are neither officially defined nor intuitive, this decision record tries to get some clarity in this topic. It discuss, which failure threats are cloud service provider facing and classifies them into several levels. |
|
||
## Decision | ||
|
||
First there needs to be an overview about possible failure cases in infrastructures: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First there needs to be an overview about possible failure cases in infrastructures: | |
First there needs to be an overview about possible failure cases in infrastructures as well as their probability of entry and the damage they may cause. |
|
||
| Failure Case | Probability | Consequences | | ||
|----|-----|----| | ||
| Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In favor to simplicity, I would assume disk loss/failure will cause permanent loss of data on this disk.
| Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) | | |
| Disk Failure/Loss | High | Permanent data loss in on this disk. Impact depends on type of lost data (data base, user data) | |
| Failure Case | Probability | Consequences | | ||
|----|-----|----| | ||
| Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) | | ||
| Node Outage | Medium to High | Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to differ between Node Failure/Loss, meaning hardware is irrecoverable damaged and node outage, caused by electricity outage, as both use cases cause different implications. Furthermore, we should define node as computation hardware without disks. This facilitates classification of use case.
| Node Outage | Medium to High | Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node) | | |
| Node Failure/Loss (without disks) | Medium to High | Permanent loss of functionality and connectivity of node (impact depends on type of node) | | |
| Node Outage | Medium to High | Temporary loss of functionality and connectivity of node (impact depends on type of node) | |
|----|-----|----| | ||
| Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) | | ||
| Node Outage | Medium to High | Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node) | | ||
| Rack Outage | Medium | similar to Disk Failure and Node Outage | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rack outage means outage of all nodes. As disks are not damaged, I prefer to limit consequences to
| Rack Outage | Medium | similar to Disk Failure and Node Outage | | |
| Rack Outage | Medium | Outage of all nodes in rack | |
| Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) | | ||
| Node Outage | Medium to High | Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node) | | ||
| Rack Outage | Medium | similar to Disk Failure and Node Outage | | ||
| Power Outage (Data Center supply) | Medium | potential data loss, temporary loss of functionality and connectivity of node (impact depends on type of node) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, I would omit "data loss" and focus on big consequence. Most protocols are working with acknowledgments. Hence, we can assume, that data loss is temporary. What we really lost are CPU and RAM data, but we should omit these consequences, as we can not prevent or avoid them.
| Power Outage (Data Center supply) | Medium | potential data loss, temporary loss of functionality and connectivity of node (impact depends on type of node) | | |
| Power Outage (Data Center supply) | Medium | temporary outage of all nodes in rack (impact depends on type of node) | |
Signed-off-by: josephineSei <[email protected]>
Signed-off-by: josephineSei <[email protected]>
Signed-off-by: josephineSei <[email protected]>
|
||
::: | ||
|
||
| Level/Class | Probability | Failure Causes | loss in IaaS | User Hints | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still thinking about the "user hints" column. Putting it next to the other columns is good from some perspectives, as it can be read: I want to achieve 2. Level of failuresafeness, which can be triggered by these failure causes that will result in these losses on IaaS level, so I can do, what is shown in the user hints.
But we wanted the classification not for examples for users, but mainly as a definiton for standards, so maybe we should not reference those standards here.
We could rather use an extra table with example actions(standards, "user has to to things",..) for each level/class or maybe this should rather not be in a decision record, but rather in a guide or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you, @josephineSei. Linking SCS standards in this table may cause a huge synchronization effort. We always have to update this DR, if referenced standards change. I appreciate the column user hints, but would limit to a textual explanation. See my suggestions below...
| (Volume) Snapshot | Thinly-provisioned copy-on-write snapshots of volumes. Stored in the same Cinder storage backend as volumes. | | ||
| Volume Type | Attribute of volumes determining storage details of a volume such as backend location or whether the volume will be encrypted. | | ||
| (Barbican) Secret | IaaS resource storing cryptographic assets such as encryption keys. Managed by the Barbican service. | | ||
| Key Encryption Key | IaaS resource, used to encrypt other keys to be able to store them encrypted in a database. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I a key encryption key really an IaaS resource? I thought key encryption keys are stored in configuration files and if this is the case, it is a configuration setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Glossary is for all maybe unknown phrases to be described. As this standard also concerns Key Encryption Keys, it should be noted in the glossary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested some minor changes, but do not want to block PR.
Co-authored-by: anjastrunk <[email protected]> Signed-off-by: josephineSei <[email protected]>
Signed-off-by: josephineSei <[email protected]>
… to scs-XXXX-w1-example-impacts-of-failure-scenarios.md Signed-off-by: josephineSei <[email protected]>
Signed-off-by: josephineSei <[email protected]>
Co-authored-by: Michal Gubricky <[email protected]> Signed-off-by: josephineSei <[email protected]>
Hey @jschoone or @tonifinger or @DEiselt could someone of you help with the KaaS part of this document (or do you know someone who can?) until the end of the week? |
Hi Josephine, Jan and i will meet today to discuss this and provide feedback :) |
Signed-off-by: Jan Schoone <[email protected]>
Signed-off-by: Jan Schoone <[email protected]>
@mbuechse @kgube @michal-gubricky @jschoone: Please review, request changes and/or approve. We want to merge this PR as soon as possible. |
@jschoone @bitkeks @DEiselt @michal-gubricky If you are okay with the total of this decision record, please approve this PR, otherwise please tell me what you think is still missing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small fixes and suggestions
Signed-off-by: josephineSei <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, left one comment regarding the usage of RTO.
Good approach of defining the levels, then the potential influences and the many components that are affected. Mapping this all to the SCS technology parts will help CSPs and users to identify those parts of their instance they need to have a look at more specifically.
Of course this document will probably evolve with more empirical data - you could for example discuss, if river floods in Germany are really "very low" probability 😃
Very important also to highlight the need for a CSP-internal risk analysis. There cannot be a generalized approach that applies everywhere!
Signed-off-by: josephineSei <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…nomy-of-failsafe-levels.md Signed-off-by: josephineSei <[email protected]>
… to scs-0118-w1-example-impacts-of-failure-scenarios.md Signed-off-by: josephineSei <[email protected]>
closes #527