Allow versioned resource access for core datasets #719

jameshadfield · 2023-10-04T01:33:39Z

See commit messages for details. Some URLs as an example:

https://nextstrain-s-james-reso-cterli.herokuapp.com/enterovirus/d68/vp1 -- the current dataset with 2395 genomes. Uploaded 2022-09-06.
https://nextstrain-s-james-reso-cterli.herokuapp.com/enterovirus/d68/vp1@2022-09-06 -- The same as the above URL, but specifying the version (so the dataset won't change over time)
https://nextstrain-s-james-reso-cterli.herokuapp.com/enterovirus/d68/vp1@2022-05-31 -- 2345 genomes. Includes a v2 main JSON, tip-frequencies & root-sequence sidecars. (If you are watching the server logs you can see the S3 version IDs for these files as they are fetched, but these are not exposed to the client.)
https://nextstrain-s-james-reso-cterli.herokuapp.com/enterovirus/d68/vp1@2022-05-30 -- doesn't exist. Will redirect to 2022-05-31 (above)
https://nextstrain-s-james-reso-cterli.herokuapp.com/enterovirus/d68/vp1@2019-09-17 -- 1653 genomes. a v1 (meta + tree) dataset, with no sidecars

Ready for review, but there are a few things before this can be merged:

More testing needed / tests added?
Run the manifest generation script somewhere (GitHub actions of a private repo on a schedule?) and upload the results to S3. This can be done following merge but we need a plan on how to implement it. See GitHub action to automate resource index rebuilding #747
Update the manifest on the server. The easiest is probably on a 1-hour interval and check for ETag changes.
Remove console.log statements
Ensure the IAM terraform configs have been applied to AWS Update: these were added in 1019326, and as of 2024-01-02 AWS has the correct permissions
As of 2023-11-07 (NZ), versioning has been enabled on nextstrain-inventories. Rerun the indexer in a day or two, once past versions of the inventory files have been deleted and thus exist as versions. (I don't expect any changes.)

src/sources/models.js

src/manifest.js

aws/iam/policy/NextstrainDotOrgServerInstanceDev.json

src/manifest.js

src/sources/community.js

tsibley · 2023-10-10T22:11:58Z

Talked with James in our 1:1. This is ready for me to review. He chose not to implement a new ResourceVersion class (as in Source → Resource → ResourceVersion → SubResource) as we'd talked about two weeks ago, but instead make the existing Resource class version-aware.

tsibley · 2023-10-16T19:48:05Z

I rebased this onto current master and made the appropriate changes to the server's IAM policies managed by Terraform: 4a0367f. I also dropped 5a33a9e, which was marked for dropping and which we don't want to keep.

tsibley

I read thru the commits today in reverse order. Didn't quite make it thru the first commit (last in order I reviewed), the one introducing the manifest/ directory. I'll finish that in the next day or so but wanted to checkpoint my comments so far. A bunch of stuff to comment on, but this feels mostly headed in the right direction.

src/endpoints/sources.js

manifest/constants.js

manifest/main.js

tsibley · 2023-10-18T17:25:28Z

@jameshadfield Thinking about my review and suggestions some more, I wanted to offer to implement some of the larger changes I've suggested and then walk you thru them if it'd be more clear for me to do that rather than just discuss them. Not a problem if not. Totally your preference here.

jameshadfield · 2023-10-19T22:13:12Z

@tsibley could you re-review this at your convenience? All comments should have been addressed, except for (i) log statements which I'll remove later and (ii) the definition of "closest" / redirect to closest code which I'll defer until the wider group makes a decision on desired behavior here. The S3 resources index has been updated to the new format.

I wanted to offer to implement some of the larger changes I've suggested and then walk you thru them if it'd be more clear for me to do that rather than just discuss them

Thanks, but I'd prefer to discuss the concepts rather than read through an implementation. For what it's worth, none of the changes were large in terms of code changes, but conceptually I guess there are a few changes.

tsibley · 2023-11-01T17:23:47Z

Some other todos:

Indexer deployment plan See GitHub action to automate resource index rebuilding #747
nextstrain-inventories bucket in Terraform Moved to [resource collection] Terraform for related AWS settings #748
Lifecycle for expiring inventories in nextstrain-inventories in Terraform Moved to [resource collection] Terraform for related AWS settings #748
Inventory config for nextstrain-data in Terraform Moved to [resource collection] Terraform for related AWS settings #748
Inventory config for nextstrain-staging in Terraform Moved to [resource collection] Terraform for related AWS settings #748
Adopt AWS config support added to master since this was branched. Probably swap away from v2 SDK and add necessary functions to src/s3.js?
Interaction with PUT/DELETE/OPTIONS methods is likely wrong. Not an immediate problem because while we have endpoints for those methods on core/staging, we don't use them. But as-written the interaction/behaviour if they did run is incorrect, and it will be an issue for Groups or if we start using those methods for core/staging (which I think we should). This has been moved to [resource collection] Interaction with PUT/DELETE/OPTIONS methods is likely wrong #744

Some of these are deferrable but they all need doing sooner than later.

jameshadfield · 2023-11-03T03:49:47Z

@tsibley could you work through your comments when you have a chance? I believe I've addressed them all. (And please resolve the ones you feel are addressed, else with this many comments the PR becomes hard to read through.)

Adopt AWS config support added to master since this was branched. Probably swap away from v2 SDK and add necessary functions to src/s3.js?

Could you add an in-line comment regarding this please. I presume you're talking about the AWS calls within updateResourceVersions?

src/resourceIndex.js

tsibley · 2023-11-07T01:20:42Z

@jameshadfield Yep. Was updating myself a bit on this today. Will continue to dive in more.

Could you add an in-line comment regarding this please. I presume you're talking about the AWS calls within updateResourceVersions?

Yes, those. Comment added.

package-lock.json

resourceIndexer/coreStagingS3.js

src/config.js

src/sources/core.js

src/sources/models.js

src/resourceIndex.js

src/sources/models.js

resourceIndexer/inventory.js

resourceIndexer/main.js

tsibley · 2023-11-09T01:05:02Z

I've done another full review here, including the new code and revisiting past comments and resolving them as appropriate.

This sets out the pattern for reading S3 inventories and turning them into resource collections. The JSON output will ultimately be used by nextstrain.org to both provide a listing of available resources and to be queried by versioned dataset requests (in order to go from a requested date to the corresponding S3 version IDs of the relevant objects). Eventually this flat JSON file may be replaced with a database, but for now this is a simple way to introduce the functionality. The collected resources JSON for core + staging is a ~3.2Mb JSON file (gzipped). When naively loaded into node it increases the total size of the allocated heap (V8) by ~60Mb (presumably this would be reduced by mapping certain string constants to variables). Currently only working for S3 buckets nextstrain-data and nextstrain-staging. Narratives are not yet considered, in part because they are not stored on S3. `node resourceIndexer/main.js --help` for how to run. AWS credentials with permission to read s3://nextstrain-inventories will need to be set in the usual way.

Parses a pre-computed index JSON and stores the data in-memory (on the nextstrain.org server). ResourceVersions is a class which dataset/narrative requests can use to get the the (versioned) file URLs for available subresources which were present for a given YYYY-MM-DD. A subsequent commit will allow the usage of "@YYYY-DD-MM" descriptors in URLs, and eventually these data will be used to display all available resources. A subsequent commit in this PR will add documentation regarding the RESOURCE_INDEX variable, but it's design was influenced by <#719 (comment)>. Briefly, to use a local index set `RESOURCE_INDEX="./path/to/json"` and to disable the index use `RESOURCE_INDEX="false"`.

@YYYY-MM-DD

Nextstrain URLs are extended to allow <path>@<version> syntax for core datasets. Currently the <version> must be in YYYY-MM-DD format. The returned version is the one which was the latest on the requested day. If the requested version predates any datasets we return 404. We attempt to extract a version descriptor for every request, however for non-core sources (and core narratives)¹ the presence of such a descriptor (i.e. if "@" is in the URL²) will result in a 400 (Bad Request) error. For further examples please see `test/date_descriptor.test.js`. Note that the @YYYY-MM-DD URLs enabled by this commit look similar to some existing URLs where the datestamp is in the dataset name (e.g. "ncov/gisaid/global/6m/2024-01-02") however conceptually these are quite different. ¹ Fetch URLs (e.g. /fetch/...) are excluded ² There are exceptions, such as community URLs allowing <repo>@<commit> in the pathname.

Includes documentation of the AWS changes which are not under terraform control, as well as an overview of the general concepts.

See added documentation for corresponding AWS details. Given that the resources all come from public-facing buckets (core & staging) it seems ok to run this from a public repo, but we may want to revisit this once we start consuming private data. The index is only generated for datasets (not intermediate files) and only for the core bucket (nextstrain-data) as the that's all that's currently handled by the server, so it saves us a little s3 storage, transfer overhead and server memory footprint. Future work listing/visualising all available data will use this and so this filtering is only temporary.

jameshadfield · 2024-01-04T18:47:02Z

As well as addressing all the open conversations above, I've added a large number of tests to give confidence to the behaviour across sources and resource types. I'm going to merge this now, but feel free to make suggestions which we can address in future PRs. There's more work I'd like to do now that the server has access to the resource index. Many thanks to the reviewers - it's been a big PR, almost certainly with the most comments of any in Nextstrain's history.

P.S. I couldn't convince Heroku to rebuild the review app (unrelated to the contents of this PR), but I did a final round of testing via dev.nextstrain.org, which has the benefit of allowing logins.

Corresponding nextstrain PR <nextstrain/nextstrain.org#719>

@YYYY-MM-DD

…esources Using the same @YYYY-MM-DD suffix syntax as on the web. Support for this server-side is recently landed.¹ ¹ <nextstrain/nextstrain.org#719>

nextstrain-bot temporarily deployed to nextstrain-s-james-reso-fr13kp October 4, 2023 01:33 Inactive

jameshadfield commented Oct 4, 2023

View reviewed changes

src/sources/models.js Outdated Show resolved Hide resolved

jameshadfield force-pushed the james/resources branch from 39f79c8 to 3343d50 Compare October 4, 2023 01:56

nextstrain-bot temporarily deployed to nextstrain-s-james-reso-cterli October 4, 2023 01:56 Inactive

jameshadfield commented Oct 4, 2023

View reviewed changes

src/manifest.js Outdated Show resolved Hide resolved

victorlin mentioned this pull request Oct 4, 2023

Update dev server IAM policy #718

Closed

2 tasks

nextstrain-bot temporarily deployed to nextstrain-s-james-reso-cterli October 4, 2023 18:14 Inactive

jameshadfield commented Oct 4, 2023

View reviewed changes

aws/iam/policy/NextstrainDotOrgServerInstanceDev.json Outdated Show resolved Hide resolved

jameshadfield commented Oct 4, 2023

View reviewed changes

src/manifest.js Outdated Show resolved Hide resolved

jameshadfield force-pushed the james/resources branch from bf89818 to 0aaf7fa Compare October 4, 2023 20:54

nextstrain-bot temporarily deployed to nextstrain-s-james-reso-cterli October 4, 2023 20:55 Inactive

jameshadfield force-pushed the james/resources branch from 0aaf7fa to 575c95a Compare October 5, 2023 03:28

nextstrain-bot temporarily deployed to nextstrain-s-james-reso-cterli October 5, 2023 03:28 Inactive

jameshadfield commented Oct 5, 2023

View reviewed changes

src/sources/community.js Outdated Show resolved Hide resolved

tsibley self-requested a review October 10, 2023 22:12

tsibley self-assigned this Oct 10, 2023

tsibley force-pushed the james/resources branch from 575c95a to 4a0367f Compare October 16, 2023 19:45

tsibley requested changes Oct 17, 2023

View reviewed changes

jameshadfield force-pushed the james/resources branch from 4a0367f to bdf7175 Compare October 18, 2023 03:40

jameshadfield force-pushed the james/resources branch 5 times, most recently from b4c57c4 to 354a458 Compare October 19, 2023 20:55

nextstrain-bot temporarily deployed to nextstrain-dev October 23, 2023 18:16 Inactive

nextstrain-bot temporarily deployed to nextstrain-s-james-reso-bce2le October 23, 2023 18:49 Inactive

jameshadfield mentioned this pull request Nov 1, 2023

[resource collection] Interaction with PUT/DELETE/OPTIONS methods is likely wrong #744

Open

jameshadfield force-pushed the james/resources branch 2 times, most recently from 255a329 to d10c484 Compare November 3, 2023 02:32

This was referenced Nov 3, 2023

GitHub action to automate resource index rebuilding #747

Closed

[resource collection] Terraform for related AWS settings #748

Open

tsibley reviewed Nov 7, 2023

View reviewed changes

src/resourceIndex.js Outdated Show resolved Hide resolved

tsibley requested changes Nov 8, 2023

View reviewed changes

tsibley reviewed Nov 9, 2023

View reviewed changes

resourceIndexer/inventory.js Outdated Show resolved Hide resolved

tsibley reviewed Nov 9, 2023

View reviewed changes

resourceIndexer/main.js Outdated Show resolved Hide resolved

tsibley mentioned this pull request Nov 21, 2023

Support for Nextstrain CLI's new means of authentication with IdPs #757

Merged

2 tasks

jameshadfield added 5 commits January 4, 2024 16:09

[docs] resource collection docs

855d3b2

Includes documentation of the AWS changes which are not under terraform control, as well as an overview of the general concepts.

jameshadfield force-pushed the james/resources branch from 0b56c84 to c0ee239 Compare January 4, 2024 03:30

jameshadfield marked this pull request as ready for review January 4, 2024 18:47

jameshadfield changed the title ~~Allow versioned resource access (core/staging)~~ Allow versioned resource access for core datasets Jan 4, 2024

jameshadfield merged commit c3dd726 into master Jan 4, 2024
7 checks passed

jameshadfield deleted the james/resources branch January 4, 2024 18:47

jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request Jan 15, 2024

Versioned resources docs

eaac099

Corresponding nextstrain PR <nextstrain/nextstrain.org#719>

This was referenced Apr 8, 2024

List datasets + files from core + staging sources #700

Closed

Allow access to S3 versioned datasets #196

Closed

tsibley mentioned this pull request Aug 13, 2024

Rebuild index job failed #975

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow versioned resource access for core datasets #719

Allow versioned resource access for core datasets #719

jameshadfield commented Oct 4, 2023 •

edited

Loading

tsibley commented Oct 10, 2023

tsibley commented Oct 16, 2023

tsibley left a comment

tsibley commented Oct 18, 2023

jameshadfield commented Oct 19, 2023

tsibley commented Nov 1, 2023 •

edited by jameshadfield

Loading

jameshadfield commented Nov 3, 2023

tsibley commented Nov 7, 2023

tsibley commented Nov 9, 2023

jameshadfield commented Jan 4, 2024

Allow versioned resource access for core datasets #719

Allow versioned resource access for core datasets #719

Conversation

jameshadfield commented Oct 4, 2023 • edited Loading

tsibley commented Oct 10, 2023

tsibley commented Oct 16, 2023

tsibley left a comment

Choose a reason for hiding this comment

tsibley commented Oct 18, 2023

jameshadfield commented Oct 19, 2023

tsibley commented Nov 1, 2023 • edited by jameshadfield Loading

jameshadfield commented Nov 3, 2023

tsibley commented Nov 7, 2023

tsibley commented Nov 9, 2023

jameshadfield commented Jan 4, 2024

jameshadfield commented Oct 4, 2023 •

edited

Loading

tsibley commented Nov 1, 2023 •

edited by jameshadfield

Loading