diff --git a/docs/index.rst b/docs/index.rst index 93a6fe004..526d7f6b8 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -31,4 +31,5 @@ nextstrain.org routing infrastructure terraform + resource-collection glossary diff --git a/docs/resource-collection.rst b/docs/resource-collection.rst new file mode 100644 index 000000000..afcc39d21 --- /dev/null +++ b/docs/resource-collection.rst @@ -0,0 +1,92 @@ +=================== +Resource Collection +=================== + +In order for nextstrain.org to handle URLs with `@YYYY-MM-DD` identifiers the +server needs to be aware of which files exist, including past versions. +In the future this data will also be used to list and display all available +resources (and their versions) to the user. + +The index is generated by a script and the resulting JSON file is loaded by the +server at start time. Resource collections can be ignored by the server by setting +the env variable ``RESOURCE_INDEX="false"`` (or the equivalent in a config file). + + +Local development +================= + +The index creation script can be run locally which will produce a local JSON +file -- see ``./resourceIndexer/main.js`` for more details. + +To use this file from the server set the env variable ``RESOURCE_INDEX`` to +point to the (JSON) file. + + +Automated index generation +========================== + +*This section will be updated once the +index creation is automated.* + +AWS settings necessary for resource collection +============================================== + +The index creation, storage and retrieval requires certain AWS settings which +are documented here as most of them are not under terraform control. We use `S3 +inventories +`__ +to list all the documents in certain buckets (or bucket prefixes) which are +generated daily by AWS. The index creation script will download these +inventories and use them to create an index JSON which it uploads to S3. The +nextstrain.org server will access this JSON from S3. + +S3 inventories +-------------- + +We currently produce inventories for the core (s3://nextstrain-data) and +staging (s3://nextstrain-staging) buckets which are generated daily and +published to s3://nextstrain-inventories. The +s3://nextstrain-inventories bucket is a private bucket. The inventory +configuration can be found in the AWS console for +`core `__ +and +`staging `__. +The config specifies that additional metadata fields for last modified +and ETag are to be included in the inventory. The inventories for core & +staging are published to +s3://nextstrain-inventories/nextstrain-data/config-v1 and +s3://nextstrain-inventories/nextstrain-staging/config-v1, respectively. +The cost of these is minimal (less than $1/bucket/year). + +A lifecycle rule on the s3://nextstrain-inventories bucket (`console +link `__) +deletes all inventory-related files 30 days after they are created. + +Index creation (Inventory access and index upload) +-------------------------------------------------- + +**Automated index generation** + +*This section will be updated once the +index creation is automated.* + +**Local index generation for development purposes** + +For local index generation (e.g. during development) you will need IAM +credentials which can list and get objects from s3://nextstrain-inventories; if +you want finer scale access for local index creation, you can restrict access to +certain prefixes in that bucket - for instance ``nextstrain-data/config-v1`` and +``nextstrain-staging/config-v1`` correspond to core and staging buckets, +respectively. + +To upload the index you will need write access for +s3://nextstrain-inventories/resources.json.gz. Note that if your aims are +limited to local development purposes this is not necessary (see `Local development`_). + + +Index access by the server +-------------------------- + +IAM users ``nextstrain.org`` and ``nextstrain.org-testing``, which are under +terraform control, have read access to +s3://nextstrain-inventories/resources.json.gz via their associated policies.