Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Availability Zone Standard #640

Merged
merged 36 commits into from
Oct 14, 2024
Merged
Changes from 12 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
dc9ad59
Create scs-XXXX-vN-Availability-Zones-Standard.md
josephineSei Jun 17, 2024
336e2dc
Update scs-XXXX-vN-Availability-Zones-Standard.md
josephineSei Jun 18, 2024
3ae46c4
First part for fire zones being through, rest will follow
josephineSei Jun 19, 2024
447d90c
Further work on other factors for AZs than fire zones
josephineSei Jun 21, 2024
eb6e5bc
First complete Draft of Availability Standard
josephineSei Jun 24, 2024
5f8c566
Update scs-XXXX-vN-Availability-Zones-Standard.md
josephineSei Jun 24, 2024
b576580
Merge branch 'main' into availability-zones-standard
josephineSei Jun 24, 2024
856e099
Apply suggestions from code review
josephineSei Jun 25, 2024
1a9140e
Restructuring and adding discussed point from IaaS call.
josephineSei Jun 27, 2024
286ff93
Apply suggestions from code review
josephineSei Aug 16, 2024
89e770a
Update scs-XXXX-vN-Availability-Zones-Standard.md
josephineSei Aug 16, 2024
afec372
Merge branch 'main' into availability-zones-standard
josephineSei Aug 19, 2024
7475746
Update scs-XXXX-vN-Availability-Zones-Standard.md
josephineSei Aug 21, 2024
6c03427
Merge branch 'main' into availability-zones-standard
josephineSei Aug 26, 2024
e660cb4
Update scs-XXXX-vN-Availability-Zones-Standard.md
josephineSei Sep 18, 2024
355e84a
Create scs-XXXX-w1-Availability-Zones-Standard.md
josephineSei Sep 18, 2024
65460b5
Update scs-XXXX-w1-Availability-Zones-Standard.md
josephineSei Sep 18, 2024
fe13def
Merge branch 'main' into availability-zones-standard
josephineSei Sep 18, 2024
8c98249
Update scs-XXXX-vN-Availability-Zones-Standard.md
josephineSei Sep 18, 2024
057d093
Update scs-XXXX-w1-Availability-Zones-Standard.md
josephineSei Sep 19, 2024
1ad5852
Update scs-XXXX-w1-Availability-Zones-Standard.md
josephineSei Sep 20, 2024
f3f76fc
Update scs-XXXX-w1-Availability-Zones-Standard.md
josephineSei Sep 20, 2024
a72ef56
Apply suggestions from code review
josephineSei Sep 25, 2024
79e0428
Update scs-XXXX-vN-Availability-Zones-Standard.md
josephineSei Sep 25, 2024
e3cc2af
Merge branch 'main' into availability-zones-standard
josephineSei Sep 26, 2024
87e3d48
Rename scs-XXXX-vN-Availability-Zones-Standard.md to scs-0119-v1-Avai…
josephineSei Sep 30, 2024
efd2ab5
Update and rename scs-XXXX-w1-Availability-Zones-Standard.md to scs-0…
josephineSei Sep 30, 2024
58d636a
Update scs-0119-w1-Availability-Zones-Standard.md
josephineSei Sep 30, 2024
720b94f
Update scs-0119-w1-Availability-Zones-Standard.md
josephineSei Sep 30, 2024
174b2bc
Merge branch 'main' into availability-zones-standard
josephineSei Oct 4, 2024
dc61967
Rename scs-0119-v1-Availability-Zones-Standard.md to scs-0120-v1-Avai…
josephineSei Oct 4, 2024
ce5cc7b
Update and rename scs-0119-w1-Availability-Zones-Standard.md to scs-0…
josephineSei Oct 4, 2024
309e4d8
Merge branch 'main' into availability-zones-standard
josephineSei Oct 8, 2024
b9d47c3
Rename scs-0120-v1-Availability-Zones-Standard.md to scs-0121-v1-Avai…
josephineSei Oct 14, 2024
fd04b68
Update scs-0120-w1-Availability-Zones-Standard.md
josephineSei Oct 14, 2024
c2423e4
Merge branch 'main' into availability-zones-standard
josephineSei Oct 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
204 changes: 204 additions & 0 deletions Standards/scs-XXXX-vN-Availability-Zones-Standard.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
---
title: Availability Zones Standard
type: Standard
status: Draft
track: IaaS
---

## Introduction

On the IaaS level especially in OpenStack it is possible to group resources in Availability Zones.
Such Zones often are mapped to the physical layer of a deployment, such as e.g. physical separation of hardware or redundancy of power circuits or fire zones.
But how CSPs apply Availability Zones to the IaaS Layer in one deplyoment may differ widely.
Therefore this standard will address the minimal requirements that need to be met, when creating Avaiability Zones.

## Terminology

| Term | Explanation |
| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- |
| Availability Zone | (also: AZ) internal representation of physical grouping of service hosts, which also lead to internal grouping of resources. |
| Fire Zone | A physical separation in a data center that will contain fire within it. Effectively stopping spreading of fire. |
| PDU | Power Distribution Unit, used to distribute the power to all physical machines of a single server rack. |
| Compute | A generic name for the IaaS service, that manages virtual machines (e.g. Nova in OpenStack). |
| Network | A generic name for the IaaS service, that manages network resources (e.g. Neutron in OpenStack). |
| Storage | A generic name for the IaaS service, that manages the storage backends and virtual devices (e.g. Cinder in OpenStack). |
| BSI | German Federal Office for Information Security (Bundesamt für Sicherheit in der Informationstechnik) |
| CSP | Cloud Service Provider, provider managing the OpenStack infrastructure. |

josephineSei marked this conversation as resolved.
Show resolved Hide resolved
## Motivation

Redundancy is a non-trivial but relevant issue for a cloud deployment.
First and foremost it is necessary to increase failure safety through redundancy on the physical layer.
The IaaS layer as the first abstraction layer from the hardware has an important role in this topic, too.
The grouping of redundant physical resources into Availability Zones on the IaaS level, gives customers the option to distribute their workload to different AZs which will result in a better failure safety.
While CSPs already have some similarities in their grouping of physical resources to AZs, there are also differences.
This standard aims to reduce those differences and will clarify, what customers can expect from Availability Zones in IaaS.

Availability Zones in IaaS can be set up for Compute, Network and Storage separately while all may be referring to the same physical separation in a deployment.
This standard elaborates the necessity of having Availability Zones for each of these classes.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
It will also check the requirements customers may have, when thinking about Availability Zones in relation to the taxonomy of failure safety levels [^1].
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
The result should enable CSPs to know when to create AZs to be SCS-compliant.

## Design Considerations

Availability Zones should represent parts of the same physical deployment that are independent of each other.
The maximum level of physical independence is achieved through putting physical machines into different fire zones.
In that case a failure case up to level 3 as described in the taxonomy of failure safety levels document[^1] will not lead to a complete outage of the deployment.

Having Availability Zones represent fire zones will also result in AZs being able to take workload from another AZ in a Failure Case of Level 3.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
So that even the destruction of one Availability Zone will not automatically include the destruction of the other AZs.

:::caution

Even with fire zones being physically designed to protect parts of a data center from severe destruction in case of a fire, this will not always succeed.
Availability Zones in Clouds are most of the time within the same physical data center.
In case of a big catastrophe like a huge fire or a flood the whole data center could be destroyed.
Availability Zones will not protect customers against these failure cases of level 4 of the taxonomy of failure safety[^1].

:::

Smaller deplyoments like edge deployments may not have more than one fire zone in a single location.
To include such deployments, it should not be required to use Availability Zones.

Other physical factors that should be considered are the power supplies, internet connection, cooling and core routing.
Availability Zones were also used by CSPs as a representations of redundant PDUs.
That means there are deployments, which have Availability Zones per rack as each rack has it's own PDU and this was considered to be the single point of failure an AZ should represent.
While this is also a possible measurement of independency it only provides failure safety for level 2.
Therefore this standard should be very clear about which independency an AZ should represent and it should not be allowed to have different deployments with their Availability Zones representing different levels of failure safety.

Additionally Availability Zones are available for Compute, Storage and Network services.
They behave differently for each of these resources and also when working across resource-based Availability Zones, e.g. attaching a volume from one AZ to a virtual machine in another AZ.
For each of these IaaS resource classes, it should be defined, under which circumstances Availability Zones should be used.

[^1]: [Taxonomy of Failsafe Levels in SCS (TODO: change link as soon as taxonomy is merged)](https://github.com/SovereignCloudStack/standards/pull/579)

### Scope of the Availability Zone Standard

When elaborating redundancy and failure safety in data centers, it is necessary to also define redundancy on the physical level.
There are already recommendations from the BSI for physical redundancy within a cloud deployment [^2].
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
This standard considers these recommendation as a basis, that is followed by most CSPs.
So this standard will not go into details, already provided by the CSP, but will rather concentrate on the IaaS layer and only have a coarse view on the physical layer.
The first assumtion from the recommendations of the BSI is that the destruction of one fire zone will not lead to an outage of all power lines (not PDUs), internet connections, core routers or cooling systems.

For the setup of Availability Zone this means, that within every AZ, there needs to be redundancy in core routers, internet connection, power lines and at least two separate cooling systems.
This should avoid having single points of failure within the Availability Zones.
But all this physical infrastructure can be the same over all Availability Zones in a deployment, when it is possible to survive the destruction of one fire zone.

[^2]: [Availability recommendations from the BSI](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/RZ-Sicherheit/RZ-Verfuegbarkeitsmassnahmen.pdf?__blob=publicationFile&v=9)

### Options considered

#### Physical-based Availability Zones

It is possible standardize the usage of Availability Zones over all IaaS resources.
The downside from this is, that the IaaS resources behave so differently, that they have different requirements for redundancy and thus Availability Zones.
This is not the way to go.
Besides that, it is already possible to create two physically separated deployments close to each other, connect them with each other and use regions to differ between the IaaS on both deployments.

The question that remains is, what an Availability Zone should consist of?
Having one Availability Zone per fire zone gives the best level of failure safety, that can be achieved by CSPs.
When building up on the relation between fire zone and physical redundancy recommendations as from the BSI, this combination is a good starting point, but need to be checked for the validity for the different IaaS resources.

Another point is where Availability Zones can be instantiated and what the connection between AZs should look like.
To have a proper way to deal with outages of one AZ, where a second AZ can step in, a few requirements need to be met for the connection between those two AZs.
The amount data that needs to be transferred very fast in a failure case may be enormous, so there is a requirement for a high bandwidth between connected AZs.
Tho avoid additional failure cases the latency between those two Availability Zones need to be low.
With such requirements it is very clear that AZs should only reside within one (physical) region of an IaaS deployment.

#### AZs in Compute

Compute Hosts are physical machines on which the compute service runs.
A single virtual machine is always running on ONE compute host.
Redundancy of virtual machines is either up to the layer above IaaS or up to the customers themself.
Having Availability Zones gives customers the possibility to let another virtual machine as a backup run within another Availability Zone.

Customers will expect that in case of the failure of one Availability Zone all other AZs are still available.
The highest possible failure safety here is achieved, when Availability Zones for Compute are used for different fire zones.

When the BSI recommendations are followed, there should already be redundancy in power lines, internet connection and cooling.
An outage of one of these physical resources will not affect the compute host and its resources for more than a minimal timeframe.
But when a single PDU is used for a rack, a failure of that PDU will result in an outage of all compute hosts in this rack.
In such a case it is not relevant, whether this rack represents a whole Availability Zone or is only part of a bigger AZ.
All virtual machines on the affected compute hosts will not be available and need to be restarted on other hosts, whether of the same Availability Zone or another.

#### AZs in Storage

There are many different backends used for the storage service with Ceph being one of the most prominent backends.
Configuring those backends can already include to span one storage cluster over physical machines in different fire zones.
In combination with internal replication a configuration is possible, that already distributes replicas from volumes over different fire zones.
When a deployment has such a configured storage backend, it already can provide safety in case of a failure of level 3.

Using Availability Zones is also possible for the storage service, but configuring AZs, when having a configuration like above will not increase safety.
Nevertheless using AZs when having different backends in different fire zones will give customers a hint to backup volumes into storages of other AZs.

Additionally when the BSI recommendations are followed, there should already be redundancy in power lines, internet connection and cooling.
An outage of one of these physical resources will not affect the storage host and its resources for more than a minimal timeframe.
When internal replication is used, either through the IaaS or through the storage backend itself, the outage of a single PDU and such a single rack will not affect the availability of the data itself.
All these physical factors are not requiring the usage of an Availability Zone for Storage.
An increase of the level of failure safety will not be reached through AZs in these cases.

Still it might be confusing when having deployments with compute AZs but without storage AZs.
CSPs may need to communicate clearly up to which failure safety level their storage service can automatically have redundancy and from which level customers are responsible for the redundancy of their data.

#### AZs in Network

Virtualized network resources can typically be quickly and easily set up from building instructions.
Those instructions are stored in the database of the networking service.

If a physical machine, on which certain network resources are set up, is not available anymore, the resources can be rolled out on another physical machine, without being depended on the current situation of the lost resources.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
There might only be a loss of a few packets within the affected network resources.

With having Compute and Storage in a good state (e.g. through having fire zones with a compute AZ each and storage being replicated over the fire zones) there would be no downsides to omitting Availability Zones for the network service.
It might even be the opposite: Having resources running in certain Availability Zones might prevent them from being scheduled in other AZs[^3].
As the network resources like routers are bound to an AZ, in a failure case of one AZ all resource definitions might still be there in the database, while the implementation of those resources is gone.
Trying to rebuild them in another AZ is not possible, because the scheduler will not allow them to be implemented in another AZ, then the one thats present in their definition.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved
In a failure case of one AZ this might lead to a lot of manual work to rebuild the SDN from scratch instead of just re-using the definitions.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved

Because of this severe sideeffect, this standard will make no recommendations about Network AZs.
josephineSei marked this conversation as resolved.
Show resolved Hide resolved

[^3]: [Availability Zones in Neutron for OVN](https://docs.openstack.org/neutron/latest/admin/ovn/availability_zones.html)

### Cross-Attaching volumes from one AZ to another compute AZ

Without the networking AZs we only need to take a closer look into attaching volumes to virtual machines across AZs.

When there is more than one Storage Availability Zone, those AZs do normally align with the Compute Availability Zones.
This means that fire zone 1 contains compute AZ 1 and storage AZ 1 , fire zone 2 contains compute AZ 2 and storage AZ 2 and the same for fire zone 3.
It is possible to allow or forbid cross-attaching volumes from one storage Availability Zone to virtual machines in another AZ.
If it is not allowed, then the creation of volume-based virtual machines will fail, if there is no space left for VMs in the corresponding Availability Zone.
While this may be unfortunate, it gives customers a very clear picture of an Availability Zone.
It clarifies that having a virtual machine in another AZ also requires having a backup or replication of volumes in the other storage AZ.
Then this backup or replication can be used to create a new virtual machine in the other AZ.

It seems to be a good decision to not encourage CSPs to allow cross-attach.
Currently CSPs also do not seem to widely use it.

## Standard

If Compute Availability Zones are used, they MUST be in different fire zones.
Availabilty Zones for Storage SHOULD be setup, if there is no storage backend used that can span over different fire zones and automatically replicate the data.
Otherwise a single Availabilty Zone for Storage SHOULD be configured.

If more than one Availability Zone for Storage is set up, the attaching of volumes from one Storage Availability Zone to another Compute Availability Zone (cross-attach) SHOULD NOT be possible.

Within each Availability Zone:

- there MUST be redundancy in power supply, as in line into the deployment
artificial-intelligence marked this conversation as resolved.
Show resolved Hide resolved
- there MUST be redundancy in external connection (e.g. internet connection or WAN-connection)
- there MUST be redundancy in core routers
- there SHOULD be at least two cooling systems, that are independent of each other
josephineSei marked this conversation as resolved.
Show resolved Hide resolved

AZs SHOULD only occur within the same region and have a low-latency interconnection with a high bandwidth.

## Related Documents

The taxonomy of failsafe levels can be used to get an overview over the levels of failure safety in a deployment(TODO: link after DR is merged.)

The BSI can be consulted for further information about [failure risks](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/Kompendium/Elementare_Gefaehrdungen.pdf?__blob=publicationFile&v=4), [risk analysis for a datacenter](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Grundschutz/BSI_Standards/standard_200_3.pdf?__blob=publicationFile&v=2) or [measures for availability](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/RZ-Sicherheit/RZ-Verfuegbarkeitsmassnahmen.pdf?__blob=publicationFile&v=9).

## Conformance Tests

As this standard will not require Availability Zones to be present, we cannot automatically test the conformance.
The other parts of the standard are physical or internal and could only be tested through an audit.
Whether there are fire zones physically available is a criteria that will never change for a single deployment - this only needs to be audited once.
It might be possible to also use Gaia-X Credentials to provide such information, which then could be tested.