Skip to content

Latest commit

 

History

History
43 lines (37 loc) · 17.7 KB

availability-checklist.md

File metadata and controls

43 lines (37 loc) · 17.7 KB

Availability checklist

Application design

  • Avoid any single point of failure. All components, services, resources, and compute instances should be deployed as multiple instances to prevent a single point of failure from affecting availability. This includes authentication mechanisms. Design the application to be configurable to use multiple instances, and to automatically detect failures and redirect requests to non-failed instances where the platform does not do this automatically.
  • Decompose workload per different SLA. If a service is composed of critical and less-critical workload, manage them differently and specify the service features and number of instances to meet their availability requirements.
  • Minimize and understand service dependencies. Minimize the number of different services used where possible, and ensure you understand all of the feature and service dependencies that exist in the system. This includes the nature of these dependencies, and the impact of failure or reduced performance in each one on the overall application. Microsoft guarantees at least 99.9% availability for most services, but this means that every additional service an application relies on potentially reduces the overall availability SLA of your system by 0.1%.
  • Design tasks and messages to be idempotent where possible so that duplicated requests will not cause problems. For example, a service can act as a consumer that handles messages sent as requests by other parts of the system that act as producers. If the consumer fails after processing the message, but before acknowledging that it has been processed, a producer might submit a repeat request which could be handled by another instance of the consumer. For this reason, consumers and the operations they carry out should be idempotent so that repeating a previously executed operation does not render the results invalid. This may mean detecting duplicated messages, or ensuring consistency by using an optimistic approach to handling conflicts.
  • Use a message broker that implements high availability for critical transactions. Many scenarios for initiating tasks or accessing remote services use messaging to pass instructions between the application and the target service. For best performance, the application should be able to send the message and then return to process more requests without needing to wait for a reply. To guarantee delivery of messages, the messaging system should provide high availability. Service Bus Message Queues implement at least once semantics, whereby each message posted to a queue will not be lost, although duplicate copies may be delivered under certain circumstances. If message processing is idempotent (see the previous item) then repeated delivery should not be a problem.
  • Design applications to gracefully degrade when reaching resource limits, and take appropriate action to minimize the impact for the user. In some cases, the load on the application may exceed the capacity of one or more parts, causing reduced availability and failed connections. Scaling can help to alleviate this, but it may reach a limit imposed by factors such as resource availability or cost. Design the application so that, in this situation, it can automatically degrade gracefully. For example, in an ecommerce system, if the order-processing subsystem is under strain (or has even failed completely), that part of the system can be temporarily disabled while allowing other functionality (such as browsing the product catalog) to continue. It might be appropriate to postpone requests to a failing subsystem, for example still enabling customers to submit orders but saving them to a queue or other safe storage mechanism for later processing when the orders subsystem is available again.
  • Gracefully handle rapid burst events. Most applications need to handle varying workloads over time, such as peaks first thing in the morning in a business application or when a new product is released in an ecommerce site. Auto-scaling can help to handle the load, but it may take some time for additional instances to come online and handle requests. Prevent sudden and unexpected bursts of activity from overwhelming the application by designing it to queue requests to the services it uses and to degrade gracefully when queues are near to full capacity. Ensure that there is sufficient performance and capacity available under non-burst conditions to drain the queues and handle outstanding requests. For more information, see the Queue-Based Load Leveling Pattern.

Deployment and maintenance

  • Deploy multiple instances of roles for each service. Microsoft makes availability guarantees for services that you create and deploy, but these guarantees are only valid if you deploy at least two instances of each role in the service. This enables one role to be unavailable while the other remains active. This is especially important if you need to deploy updates to a live system without interrupting clients' activities; instances can be taken down and upgraded individually while the others continue online.

  • Host applications in multiple datacenters. Although extremely unlikely, it is possible for an entire datacenter to go offline through an event such as a natural disaster or trunk Internet failure. Vital business applications should be hosted in more than one datacenter to provide maximum availability. This has an added advantage in that it can reduce latency for local users, and provides additional opportunities for flexibility when updating applications.

  • Automate and test deployment and maintenance tasks. Distributed applications consist of multiple parts that must work together. Deployment should therefore be automated using tested and proven mechanisms such as scripts and deployment applications that update and validate configuration, and automate the deployment process. Automated techniques should also be used to perform updates of all or parts of applications. It is vital to test all of these processes fully to ensure that errors do not cause additional downtime. All deployment tools must have suitable security restrictions to protect the deployed application; define and enforce deployment policies carefully and minimize the need for human intervention.

  • Consider using staging and production features of the platform where these are available. For example, using Azure Cloud Services staging and production environments allows applications to be switched from one to another instantly through a virtual IP address swap (VIP Swap). However, if you prefer to stage on-premises, or deploy different versions of the application concurrently and gradually migrate users, you may not be able to use a VIP Swap operation.

  • Apply configuration changes without recycling the instance when possible. In many cases, the configuration settings for an Azure application or service can be changed without requiring the role to be restarted. Role expose events that can be handled to detect configuration changes and apply them to components within the application. However, some changes to the core platform settings will require a role to be restarted. When building components and services, maximize availability and minimize downtime by designing them to accept changes to configuration settings without requiring the application as a whole to be restarted.

  • Use upgrade domains for zero downtime during updates. Azure compute units such as web and worker roles are allocated to upgrade domains. Upgrade domains group role instances together so that, when a rolling update takes place, each role in the upgrade domain is stopped, updated, and restarted in turn in order to minimize the impact on availability of the application. You can specify how many upgrade domains should be created for a service when the service is deployed.

    Note: Roles are also distributed across fault domains, each of which is reasonably independent from other fault domains in terms of server rack, power, and cooling provision, in order to minimize the chance of a failure affecting all role instances. This distribution occurs automatically and you cannot control it.

  • Configure availability sets for Azure virtual machines. Placing two or more virtual machines in the same availability set guarantees that these virtual machines will not be deployed to the same fault domain. To maximize availability, you should create multiple instances of each critical virtual machine used by your system and place these instances in the same availability set. If you are running multiple virtual machines that serve different purposes, create an availability set for each virtual machine and add instances of each virtual machine to each availability set. For example, if you have created separate virtual machines to act as a web server and a reporting server, create an availability set for the web server and another availability set for the reporting server. Add instances of the web server virtual machine to the web server availability set, and add instances of the reporting server virtual machine to the reporting server availability set.

Data management

  • Take advantage of data replication through both local and geographical redundancy. Data in Azure storage is automatically replicated to protect against loss in case of infrastructure failure, and some factors of this replication can be configured. For example, read-only copies of data may be replicated in more than one geographical region (referred to as read-access globally redundant storage or RA-GRS). Note that using RA-GRS incurs additional charges�see the Azure Storage Pricing page on the Microsoft website for details.
  • Use optimistic concurrency and eventual consistency where possible. Transactions that block access to resources through locking (pessimistic concurrency) can cause poor performance and considerably reduce availability. These problems can become especially acute in distributed systems. In many cases, careful design and techniques such as partitioning can minimize the chances of conflicting updates occurring. Where data is replicated, or is read from a separately updated store, the data will only be eventually consistent but the advantages usually far outweigh the impact on availability of using transactions to ensure immediate consistency.
  • Use periodic backup and point in time restore, and ensure it meets the Recovery Point Objective (RPO). Regularly and automatically back up data that is not preserved elsewhere, and verify you can reliably restore both the data and the application itself should a failure occur. Data replication is not a backup feature because errors and inconsistencies introduced through failure, error, or malicious operations will be replicated across all stores. The backup process must be secure to protect the data in transit and in storage. Databases or parts of a data store can usually be recovered to a previous point in time by using transaction logs. Microsoft Azure provides a backup facility for data stored in Microsoft Azure SQL Database. The data is exported to a backup package on Microsoft Azure blob storage, and can be downloaded to a secure on-premises location for storage.
  • Enable the high availability option to maintain a secondary copy of a Redis cache. When using Redis Cache, choose the Standard option to maintain a secondary copy of the contents. For more information, see the page Create a cache in Azure Redis Cache on the Microsoft website.

Errors and failures

  • Introduce the concept of a timeout. Services and resources may become unavailable, causing requests to fail. Ensure that the timeouts you apply are appropriate for each service or resource as well as the client that is accessing them (in some cases, it may be appropriate to allow a longer timeout for a particular instance of a client, depending on the context and other actions that the client is performing). Very short timeouts may cause excessive retry operations for services and resources that have considerable latency, but very long timeouts can cause blocking if a large number of requests are queued waiting for a service or resource to respond.
  • Retry failed operations caused by transient faults. Design a retry strategy for access to all services and resources where they do not inherently support automatic connection retry. Use a strategy that includes an increasing delay between retries as the number of failures increases to prevent overloading of the resource and to allow it to gracefully recover and handle queued requests. Continual retries with very short delays are likely to exacerbate the problem.
  • Stop sending requests to avoid cascading failures when remote services are unavailable. There may be situations where transient or other faults, ranging in severity from a partial loss of connectivity to the complete failure of a service, take much longer than expected to return to normal. Additionally, if a service is very busy, failure in one part of the system may lead to cascading failures, and result in many operations becoming blocked while holding onto critical system resources such as memory, threads, and database connections. Instead of continually retrying an operation that is unlikely to succeed, the application should quickly accept that the operation has failed, and gracefully handle this failure. You can use the Circuit Breaker pattern to reject requests for specific operations for defined periods. The Circuit Breaker Patternpage on the Microsoft website provides more details.
  • Compose or fall back to multiple components to mitigate the impact of a specific service being offline or unavailable. Design applications to take advantage of multiple instances without affecting operation and existing connections where possible. Use multiple instances and distribute requests between them, and detect and avoiding sending requests to failed instances, in order to maximize availability.
  • Fall back to a different service or workflow where possible. For example, if writing to SQL Database fails, temporarily store data in blob storage, and provide a facility to replay the writes in blob storage to SQL Database when the service becomes available. In some cases, a failed operation may have an alternative action that allows the application to continue to work even when a component or service fails. If possible, detect failures and redirect requests to other services that can offer a suitable alternative functionality, or to back up or reduce functionality instances that can maintain core operations while the primary service is offline.

Monitoring and disaster recovery

  • Provide rich instrumentation for likely failures and failure events to report the situation to operations staff. For failures that are likely but have not yet occurred, provide sufficient data to enable operations staff to determine the cause, mitigate the situation, and ensure that the system remains available. For failures that have already occurred, the application should return a suitable error message to the user but attempt to continue running, albeit with reduced functionality. In all cases, the monitoring system should capture comprehensive details to enable operations staff to effect a quick recovery, and if necessary for designers, and developers to modify the system to prevent the situation from arising again.
  • Monitor system health by implementing checking functions. The health and performance of an application can degrade over time without being noticeable until it fails. One way to guard against this is to implement probes or check functions that are executed regularly from outside the application. These checks can be as simple as measuring response time for the application as a whole, for individual parts of the application, for individual services that the application uses, or for individual components. Check functions can execute processes to ensure they produce valid results, measure latency and check availability, and extract information from the system.
  • Regularly test all failover and fallback systems to ensure they are available and operate as expected. Changes to systems and operations may affect failover and fallback functions, but the impact may not be detected until the main system fails or becomes overloaded. It is good practice to test it before it is required to compensate for a live problem at runtime.
  • Test the monitoring systems. Automated failover and fallback systems, and manual visualization of system health and performance by using dashboards all depend on the monitoring and instrumentation functioning correctly. If these elements fail, miss critical information, or report inaccurate data then an operator might not realize that the system is unhealthy or failing.
  • Track the progress of long running workflows and retry on failure. Long running workflows are often composed of multiple steps. When designing these types of workflows ensure that each step is independent and can be retried to minimize the chance that the entire workflow will need to be rolled back or that multiple compensating transactions need to be executed. Monitor and manage the progress of long-running workflows by implementing a pattern such as Scheduler Agent Supervisor. For more information, see the Scheduler Agent Supervisor Pattern page on the Microsoft website.
  • Plan for disaster recovery. Ensure there is a documented, agreed, and fully tested plan for recovery from any type of failure that may render part or all of the main system unavailable. Test the procedures regularly and ensure that all operations staff are familiar with the process.