-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GitHub action to automate resource index rebuilding #747
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This sets out the pattern for reading S3 inventories and turning them into resource collections. The JSON output will ultimately be used by nextstrain.org to both provide a listing of available resources and to be queried by versioned dataset requests (in order to go from a requested date to the corresponding S3 version IDs of the relevant objects). Eventually this flat JSON file may be replaced with a database, but for now this is a simple way to introduce the functionality. The collected resources JSON for core + staging is a ~3.2Mb JSON file (gzipped). When naively loaded into node it increases the total size of the allocated heap (V8) by ~60Mb (presumably this would be reduced by mapping certain string constants to variables). Currently only working for S3 buckets nextstrain-data and nextstrain-staging. Narratives are not yet considered, in part because they are not stored on S3. `node resourceIndexer/main.js --help` for how to run. AWS credentials with permission to read s3://nextstrain-inventories will need to be set in the usual way.
Parses a pre-computed index JSON and stores the data in-memory (on the nextstrain.org server). ResourceVersions is a class which dataset/narrative requests can use to get the the (versioned) file URLs for available subresources which were present for a given YYYY-MM-DD. Eventually these data will be used to list available resources via some API. The pre-computed JSON (see previous commit) is expected to be at a predefined S3 location or a local file may be used by setting the ENV variable `RESOURCE_INDEX="./path/to/json". Resource collection parsing may be skipped entirely by setting RESOURCE_INDEX="false".
…son.gz Based on changes @jameshadfield made in the AWS Console, but stripped down to just the single object necessary by the current consuming code.
This character is reserved for identifying the version descriptor, which will be implemented in the following commit.
Nextstrain URLs are extended to allow <path>@<version> syntax for core datasets. Currently the <version> must be in YYYY-MM-DD format. The returned version is the one which was the latest on the requested day. If the requested version predates any datasets we return 404. Note that the current implementation (via previous commits) uses S3 versioning, not our datestamped datasets, although the two concepts may appear similar in the URL.
Includes documentation of the AWS changes which are not under terraform control, as well as a general introduction to the general concept.
nextstrain-bot
temporarily deployed
to
nextstrain-s-james-reso-rajal9
November 3, 2023 03:41
Inactive
6 tasks
jameshadfield
commented
Nov 3, 2023
jameshadfield
force-pushed
the
james/resources-automate
branch
from
November 5, 2023 20:55
ea51b50
to
1db71e0
Compare
nextstrain-bot
temporarily deployed
to
nextstrain-s-james-reso-rajal9
November 5, 2023 20:55
Inactive
nextstrain-bot
temporarily deployed
to
nextstrain-s-james-reso-rajal9
November 6, 2023 20:43
Inactive
jameshadfield
force-pushed
the
james/resources-automate
branch
2 times, most recently
from
November 6, 2023 20:52
1db71e0
to
73accc2
Compare
nextstrain-bot
temporarily deployed
to
nextstrain-s-james-reso-rajal9
November 6, 2023 20:53
Inactive
jameshadfield
force-pushed
the
james/resources-automate
branch
from
November 6, 2023 20:55
73accc2
to
5733c15
Compare
nextstrain-bot
temporarily deployed
to
nextstrain-s-james-reso-rajal9
November 6, 2023 20:55
Inactive
nextstrain-bot
temporarily deployed
to
nextstrain-s-james-reso-rajal9
November 6, 2023 20:55
Inactive
tsibley
reviewed
Nov 6, 2023
tsibley
reviewed
Nov 6, 2023
7 tasks
victorlin
referenced
this pull request
in nextstrain/docs.nextstrain.org
Nov 6, 2023
Workaround GitHub Actions' poor dev cycle for new workflows introduced on branches. Actual content of the workflow to follow in the next commit. The workaround is this: 1. Push a new but invalid workflow (such as an empty file) to a branch. GitHub notices the new workflow is invalid and creates an errored out "run" of the workflow to report the problem. This "run" has the side effect of "registering" the new workflow and making it visible to the UI, REST API, etc. 2. Push the actual workflow to the branch. 3. Manually test the workflow on the branch by triggering workflow_dispatch events using the REST API. Step 1 is crucial because of the side effects it produces. If you skip it, GitHub's UI, REST API, etc. won't know about the new workflow on the branch, and it won't be triggerable.
jameshadfield
force-pushed
the
james/resources-automate
branch
from
November 6, 2023 22:37
1fce9ab
to
53b64de
Compare
nextstrain-bot
temporarily deployed
to
nextstrain-s-james-reso-rajal9
November 6, 2023 22:37
Inactive
See added documentation for corresponding AWS details. Given that the resources all come from public-facing buckets (core & staging) it seems ok to run this from a public repo, but we may want to revisit this once we start consuming private data.
These changes were made as part of the automation of resource indexing, largely to enable backups of the index by versioning the bucket.
jameshadfield
force-pushed
the
james/resources-automate
branch
from
November 6, 2023 22:42
53b64de
to
0256b44
Compare
nextstrain-bot
temporarily deployed
to
nextstrain-s-james-reso-rajal9
November 6, 2023 22:42
Inactive
tsibley
approved these changes
Nov 9, 2023
tsibley
reviewed
Nov 9, 2023
jameshadfield
force-pushed
the
james/resources
branch
from
January 4, 2024 03:30
0b56c84
to
c0ee239
Compare
The commits in this PR have been combined and shifted to #719. The changes since this PR are:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
GitHub action to rebuild the resource collection index.
Action has not been tested yet.Action ran successfully and the uploaded index was tested locally.