[docs] resource collection docs

Includes documentation of the AWS changes which are not under terraform control, as well as a general introduction to the general concept.
nextstrain · Nov 3, 2023 · d10c484 · d10c484
1 parent 0acaaf4
commit d10c484
Show file tree

Hide file tree

Showing 2 changed files with 93 additions and 0 deletions.
diff --git a/docs/index.rst b/docs/index.rst
@@ -31,4 +31,5 @@ nextstrain.org
     routing
     infrastructure
     terraform
+    resource-collection
     glossary
diff --git a/docs/resource-collection.rst b/docs/resource-collection.rst
@@ -0,0 +1,92 @@
+===================
+Resource Collection
+===================
+
+In order for nextstrain.org to handle URLs with `@YYYY-MM-DD` identifiers the
+server needs to be aware of which files exist, including past versions.
+In the future this data will also be used to list and display all available
+resources (and their versions) to the user.
+
+The index is generated by a script and the resulting JSON file is loaded by the
+server at start time. Resource collections can be ignored by the server by setting
+the env variable ``RESOURCE_INDEX="false"`` (or the equivalent in a config file).
+
+
+Local development
+=================
+
+The index creation script can be run locally which will produce a local JSON
+file -- see ``./resourceIndexer/main.js`` for more details.
+
+To use this file from the server set the env variable ``RESOURCE_INDEX`` to
+point to the (JSON) file.
+
+
+Automated index generation
+==========================
+
+*This section will be updated once the
+index creation is automated.*
+
+AWS settings necessary for resource collection
+==============================================
+
+The index creation, storage and retrieval requires certain AWS settings which
+are documented here as most of them are not under terraform control. We use `S3
+inventories
+<https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html>`__
+to list all the documents in certain buckets (or bucket prefixes) which are
+generated daily by AWS. The index creation script will download these
+inventories and use them to create an index JSON which it uploads to S3. The
+nextstrain.org server will access this JSON from S3.
+
+S3 inventories
+--------------
+
+We currently produce inventories for the core (s3://nextstrain-data) and
+staging (s3://nextstrain-staging) buckets which are generated daily and
+published to s3://nextstrain-inventories. The
+s3://nextstrain-inventories bucket is a private bucket. The inventory
+configuration can be found in the AWS console for
+`core <https://s3.console.aws.amazon.com/s3/management/nextstrain-data/inventory/view?region=us-east-1&id=config-v1>`__
+and
+`staging <https://s3.console.aws.amazon.com/s3/management/nextstrain-staging/inventory/view?region=us-east-1&id=config-v1>`__.
+The config specifies that additional metadata fields for last modified
+and ETag are to be included in the inventory. The inventories for core &
+staging are published to
+s3://nextstrain-inventories/nextstrain-data/config-v1 and
+s3://nextstrain-inventories/nextstrain-staging/config-v1, respectively.
+The cost of these is minimal (less than $1/bucket/year).
+
+A lifecycle rule on the s3://nextstrain-inventories bucket (`console
+link <https://s3.console.aws.amazon.com/s3/management/nextstrain-inventories/lifecycle/view?region=us-east-1&id=delete+stale+inventories>`__)
+deletes all inventory-related files 30 days after they are created.
+
+Index creation (Inventory access and index upload)
+--------------------------------------------------
+
+**Automated index generation**
+
+*This section will be updated once the
+index creation is automated.*
+
+**Local index generation for development purposes**
+
+For local index generation (e.g. during development) you will need IAM
+credentials which can list and get objects from s3://nextstrain-inventories; if
+you want finer scale access for local index creation, you can restrict access to
+certain prefixes in that bucket - for instance ``nextstrain-data/config-v1`` and
+``nextstrain-staging/config-v1`` correspond to core and staging buckets,
+respectively.
+
+To upload the index you will need write access for
+s3://nextstrain-inventories/resources.json.gz. Note that if your aims are
+limited to local development purposes this is not necessary (see `Local development`_).
+
+
+Index access by the server
+--------------------------
+
+IAM users ``nextstrain.org`` and ``nextstrain.org-testing``, which are under
+terraform control, have read access to
+s3://nextstrain-inventories/resources.json.gz via their associated policies.