Skip to content

Commit

Permalink
Text tidy up, semantically minor
Browse files Browse the repository at this point in the history
  • Loading branch information
frossie committed Dec 27, 2024
1 parent 35b9fb1 commit 157ff38
Showing 1 changed file with 37 additions and 23 deletions.
60 changes: 37 additions & 23 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Guidelines for gated updates for SQuaRE services (including Science Platform)

.. abstract::

The aim of this document is to expose the heuristics by which we consider what the appropriate process is for various types of service updates. We also describe our extant processes around Science Platform and Science Platform adjacent services, as these are likely to be of most interest to developers outside our team.
The aim of this document is to expose the heuristics by which we consider what the appropriate process is for various types of service updates. We also describe our extant processes around Science Platform and Science Platform adjacent services, as these are likely to be of most interest to developers outside our team.

..
Technote content.
Expand Down Expand Up @@ -52,19 +52,19 @@ Guidelines for gated updates for SQuaRE services (including Science Platform)
.. note::

The aim of this document is to expose the heuristics by which we consider what the appropriate process is for various types of service updates.
We also describe our extant processes around Science Platform and Science Platform adjacent services, as these are likely to be of most interest to developers outside our team.
We also describe our extant processes around Science Platform and Science Platform adjacent services, as these are likely to be of most interest to developers outside our team.

.. Add content here.
.. Do not include the document title (it's automatically added from metadata.yaml).
.. note::

This is a *descriptive* (*not* prescriptive) document describing details of current practice of a specific team in a specific area. For the actual developer guide, see developer.lsst.io


Background
==========

SQuaRE uses Configuration as Code (aka “gitops”) practices for service deployment whenever practical.
What this means is that by merging code or configuration to a special branch (typically master), it is *possible* to trigger an action (via an agent or supervisory system) that automatically results in the deployment of the new version of the service.

Expand All @@ -80,16 +80,16 @@ Where any one particular instance of any one particular service lies in the Axis

- The service’s maturity, including the number of users depending on the service

- The service environment, including whether it’s production, staging or development environment
- The service environment, including whether it is production, staging or development environment

- The service’s userbase, including whether it is a feature-seeking or risk averse population
- The service’s userbase, including whether it is a feature-seeking or risk-averse population

- The actual and reputational impact of disruptions to the service

- The nature of the change being made.

It’s easy to see that a major change in a mature high visibility service that many users depend on is on the opposite side of the axis that a cosmetic change in a service under development deployed on a sandbox.
The trickier situations lie in between these extremes, and the aim of this document is to expose the heuristics by which we consider these questions.
It follows that a major change in a mature high visibility service that many users depend on is on the opposite side of the axis that a cosmetic change in a service under development deployed on a sandbox.
The trickier situations lie in between these extremes, and the aim of this document is to expose the heuristics by which we determine these questions.
We also describe our extant processes around the update of Science Platform and Science Platform adjacent services, as these are likely to be of most interest to developers outside our team and since they exercise a lot of the decision space described above.

Kubernetes services and Argo CD
Expand All @@ -101,7 +101,7 @@ SQuaRE develops Kubernetes-based services, and uses the ArgoCD platform to manag
:name: fig-argocd

The overview panel of the ArgoCD UI

ArgoCD continuously monitors running services and compares their state with that of their git repo. When it detects that a service is “out of sync” (ie there is drift between its deployed state and the repository state) it can sync automatically (the continuous deployment model) or as in Figure 2, indicate visually the out-of-sync state until an engineer pushes the sync button to resolve it (the gated deployment model).
Generally unless working in a dev or bleed environment, we do not allow ArgoCD to sync automatically.

Expand Down Expand Up @@ -131,40 +131,54 @@ Process

The process used to deploy a specific instance of a service depends on where it lies on the axis of formality above. We use the following terminology for the various deployment environments, in decreasing formality:

- Prod: Production, the user-facing deployment. For historical reasons sometimes referred to as “stable” where the Science Platform is concerned.
- **Prod:** Production, a user-facing deployment. For historical reasons sometimes referred to as “stable” where the Science Platform is concerned.

- **Int:** Integration, a deployment for developers, typically from different teams, to converge upon to test the integration of services prior to deployment on Prod. This is the environment somethings referred to as “staging”.

- **Dev:** Development, an instance that is primarily for developer testing. It may not be always available, or there may be several of them.

- **Bleed:** An environment that is left uncontrolled, either by being continuously deployed from master or by letting otherwise pinned versions of components float up

- Int: Integration, a deployment for developers, typically from different teams, to converge upon to test the integration of services prior to deployment on Prod. This is the environment somethings referred to as “staging”. For telescope deployments, this is currently the NCSA Test Stand [NTS]
The line between Dev and Int can be different depending on the environment and its stakeholder teams.
For example data-dev.lsst.cloud is strictly reserved for (mainly) SQuaRE developers working on service infrastructure and other development that can result in core services being non-functional; meanwhile science platform application developers target data-int.lsst.cloud as they can work on their application on an otherwise stable environment.
On the other hand, the Dev system for telescope services (tucson-teststand.lsst.codes) is treated more carefully by SQuaRE developers to avoid interfering with telescope service infrastucture work.

- Dev: Development, an instance that is primarily for developer testing. It may not be always available, or there may be several of them.

- Bleed: An environment that is left uncontrolled, either by being continuously deployed from master or by letting otherwise pinned versions of components float up
In some cases work is corralled in scheduled maintenance windows.
Reasons for this include:

- To minimize the potential of disruption in high availability environments

- To allow co-ordination of work on multiple services with inter-dependencies and/or infrastructure

- To communicate a “hold off reporting problems” message to users in order to avoid the report of transient issues associated with the upgrade

- To enable the "Patch day call" format where maintainers of environments not operated by SQuaRE can get on zoom with us to roll out updates in "pair programming" mode for easy access to help if they have any questions or unanticipated problems due to peculiarities of their IT infrastructure

In some cases work is corralled in routine maintenance windows. This is both to minimize the potential of disruption in high availability environments and to allow co-ordination of work on multiple services and/or infrastructure; and also to issue a “hold off reporting problems” to users.
*Maintenance windows do not imply fixed downtime.*
Downtime (complete service unavailability) is extremely rare and we design our processes to avoid it.
Routine maintenance work involves transient service unavailability at most and in most cases users are not barred from using the system during that time, though they are given notice to save their work and there is always a small chance of unforseen problems.
A co-ordinator is assigned to announce work start and work end and field any questions.
In some environments a co-ordinator is assigned to announce work start and work end and field any questions.

Current fixed maintenance windows (for applicable services/deployments) are:

- All telescope deployments: 1st Wednesday of every month, late afternoon in Chile (approx 2pm PT), confirmed with the telescope software configuration manager.
- Telescope environments: Typically 1st Wednesday of every month during lunchtime at the observatory summit (13:00 Chile local), confirmed with the telescope software configuration manager.

- Any other deployment subject to maintenance window: Weekly, Thursday afternoons (approx 3pm PT).
- Any other deployment subject to maintenance window: Weekly, Thursday afternoons (15:00 Pacific), colloquially referred to as Patch Thursday

Again, these are not scheduled downtimes. In the event that extended service downtime is needed in a production service (extremely rare), work would be scheduled out of normal hours with ample notice and co-ordination with stakeholders.
Again, these are not scheduled downtimes. In the event that extended service downtime is needed in a production service (extremely rare), work would be scheduled with ample notice and co-ordination with stakeholders and/or at a time where disruption is minimized.

Here is a chart showing the current settled-upon practice in select areas:


.. raw:: html
:file: table1.html

Container Environments in the Science Platform
===============================================

The above discussion pertains to services, ie codebases where an error could affect a service's availability. When it comes to containers made available _by_ a service (eg in nublado), we are less risk averse as users, by design, can always fall back on a previously usable container in case of problems. We recognize that the Science Platform is a primary user environment and as such users do not wish to wait a week for some process to take its course in order to get a requested package or feature. We currently (and anticipate continuing to do so) provide containers labeled "experimental" to rapidly service ad-hoc user requests, for example.
The above discussion pertains to services, ie codebases where an error could affect a service's availability. When it comes to containers made available _by_ a service (eg in nublado), we are less risk averse as users, by design, can always fall back on a previously usable container in case of problems. We recognize that the Science Platform is a primary user environment and as such users do not wish to wait a week for some process to take its course in order to get a requested package or feature. We currently (and anticipate continuing to do so) provide containers labeled "experimental" to rapidly service ad-hoc user requests, for example, in addition to generally available daily builds.

How to reconcile this user-first orientation to the issue of scientific reproducibility is a matter for a future technote.
How to reconcile this user-first orientation to the issue of scientific reproducibility is a matter for a future technote.

The Recommended Container
-------------------------
Expand All @@ -190,7 +204,7 @@ The steps are:

#. During an advertised maintenance window, e.g. "Patch Thursday", the proposed image will be promoted as recommended.


.. .. rubric:: References
.. Make in-text citations with: :cite:`bibkey`.
Expand Down

0 comments on commit 157ff38

Please sign in to comment.