Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GitHub action to automate resource index rebuilding #747

Closed
wants to merge 8 commits into from

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Nov 3, 2023

GitHub action to rebuild the resource collection index. Action has not been tested yet. Action ran successfully and the uploaded index was tested locally.

jameshadfield and others added 6 commits November 3, 2023 15:20
This sets out the pattern for reading S3 inventories and turning them
into resource collections. The JSON output will ultimately be used by
nextstrain.org to both provide a listing of available resources and to
be queried by versioned dataset requests (in order to go from a
requested date to the corresponding S3 version IDs of the relevant
objects).

Eventually this flat JSON file may be replaced with a database,
but for now this is a simple way to introduce the functionality. The
collected resources JSON for core + staging is a ~3.2Mb JSON file
(gzipped). When naively loaded into node it increases the total size of
the allocated heap (V8) by ~60Mb (presumably this would be reduced by
mapping certain string constants to variables).

Currently only working for S3 buckets nextstrain-data and
nextstrain-staging. Narratives are not yet considered, in part because
they are not stored on S3.

`node resourceIndexer/main.js --help` for how to run. AWS credentials
with permission to read s3://nextstrain-inventories will need to be set
in the usual way.
Parses a pre-computed index JSON and stores the data in-memory (on the
nextstrain.org server). ResourceVersions is a class which
dataset/narrative requests can use to get the the (versioned) file URLs
for available subresources which were present for a given YYYY-MM-DD.
Eventually these data will be used to list available resources via some
API.

The pre-computed JSON (see previous commit) is expected to be at a
predefined S3 location or a local file may be used by setting the ENV
variable `RESOURCE_INDEX="./path/to/json". Resource collection parsing
may be skipped entirely by setting RESOURCE_INDEX="false".
…son.gz

Based on changes @jameshadfield made in the AWS Console, but stripped
down to just the single object necessary by the current consuming code.
This character is reserved for identifying the version descriptor,
which will be implemented in the following commit.
Nextstrain URLs are extended to allow <path>@<version> syntax for core
datasets. Currently the <version> must be in YYYY-MM-DD format. The
returned version is the one which was the latest on the requested day.
If the requested version predates any datasets we return 404.

Note that the current implementation (via previous commits) uses S3
versioning, not our datestamped datasets, although the two concepts may
appear similar in the URL.
Includes documentation of the AWS changes which are not under terraform
control, as well as a general introduction to the general concept.
@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-james-reso-rajal9 November 5, 2023 20:55 Inactive
@jameshadfield jameshadfield marked this pull request as ready for review November 5, 2023 20:55
@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-james-reso-rajal9 November 6, 2023 20:43 Inactive
@jameshadfield jameshadfield force-pushed the james/resources-automate branch 2 times, most recently from 1db71e0 to 73accc2 Compare November 6, 2023 20:52
@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-james-reso-rajal9 November 6, 2023 20:53 Inactive
@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-james-reso-rajal9 November 6, 2023 20:55 Inactive
@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-james-reso-rajal9 November 6, 2023 20:55 Inactive
.github/workflows/resource-indexer.yml Outdated Show resolved Hide resolved
docs/resource-collection.rst Show resolved Hide resolved
.github/workflows/resource-indexer.yml Outdated Show resolved Hide resolved
victorlin referenced this pull request in nextstrain/docs.nextstrain.org Nov 6, 2023
Workaround GitHub Actions' poor dev cycle for new workflows introduced
on branches.  Actual content of the workflow to follow in the next
commit.

The workaround is this:

 1. Push a new but invalid workflow (such as an empty file) to a branch.

    GitHub notices the new workflow is invalid and creates an errored
    out "run" of the workflow to report the problem.  This "run" has the
    side effect of "registering" the new workflow and making it visible
    to the UI, REST API, etc.

 2. Push the actual workflow to the branch.

 3. Manually test the workflow on the branch by triggering
    workflow_dispatch events using the REST API.

Step 1 is crucial because of the side effects it produces.  If you skip
it, GitHub's UI, REST API, etc. won't know about the new workflow on the
branch, and it won't be triggerable.
@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-james-reso-rajal9 November 6, 2023 22:37 Inactive
See added documentation for corresponding AWS details. Given that the
resources all come from public-facing buckets (core & staging) it seems
ok to run this from a public repo, but we may want to revisit this once
we start consuming private data.
These changes were made as part of the automation of resource indexing,
largely to enable backups of the index by versioning the bucket.
@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-james-reso-rajal9 November 6, 2023 22:42 Inactive
@jameshadfield jameshadfield requested a review from a team November 6, 2023 22:48
@tsibley tsibley self-requested a review November 7, 2023 22:33
@jameshadfield
Copy link
Member Author

The commits in this PR have been combined and shifted to #719. The changes since this PR are:

  • Action filename renamed as suggested
  • The added documentation slightly changed after editing
  • The indexer is now run with additional arguments --resourceTypes dataset --collections core as that's the only data that the server will currently use for the functionality introduced in Allow versioned resource access for core datasets #719. These arguments will be removed as we expand the functionality of nextstrain.org to allow further information in the index to be used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants