Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow versioned resource access for core datasets #719

Merged
merged 5 commits into from
Jan 4, 2024

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Oct 4, 2023

See commit messages for details. Some URLs as an example:

Ready for review, but there are a few things before this can be merged:

  • More testing needed / tests added?
  • Run the manifest generation script somewhere (GitHub actions of a private repo on a schedule?) and upload the results to S3. This can be done following merge but we need a plan on how to implement it. See GitHub action to automate resource index rebuilding #747
  • Update the manifest on the server. The easiest is probably on a 1-hour interval and check for ETag changes.
  • Remove console.log statements
  • Ensure the IAM terraform configs have been applied to AWS Update: these were added in 1019326, and as of 2024-01-02 AWS has the correct permissions
  • As of 2023-11-07 (NZ), versioning has been enabled on nextstrain-inventories. Rerun the indexer in a day or two, once past versions of the inventory files have been deleted and thus exist as versions. (I don't expect any changes.)

@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-james-reso-fr13kp October 4, 2023 01:33 Inactive
src/sources/models.js Outdated Show resolved Hide resolved
@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-james-reso-cterli October 4, 2023 01:56 Inactive
src/manifest.js Outdated Show resolved Hide resolved
@victorlin victorlin mentioned this pull request Oct 4, 2023
2 tasks
@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-james-reso-cterli October 4, 2023 18:14 Inactive
src/manifest.js Outdated Show resolved Hide resolved
@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-james-reso-cterli October 4, 2023 20:55 Inactive
@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-james-reso-cterli October 5, 2023 03:28 Inactive
@tsibley
Copy link
Member

tsibley commented Oct 10, 2023

Talked with James in our 1:1. This is ready for me to review. He chose not to implement a new ResourceVersion class (as in Source → Resource → ResourceVersion → SubResource) as we'd talked about two weeks ago, but instead make the existing Resource class version-aware.

@tsibley
Copy link
Member

tsibley commented Oct 16, 2023

I rebased this onto current master and made the appropriate changes to the server's IAM policies managed by Terraform: 4a0367f. I also dropped 5a33a9e, which was marked for dropping and which we don't want to keep.

Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read thru the commits today in reverse order. Didn't quite make it thru the first commit (last in order I reviewed), the one introducing the manifest/ directory. I'll finish that in the next day or so but wanted to checkpoint my comments so far. A bunch of stuff to comment on, but this feels mostly headed in the right direction.

src/endpoints/sources.js Outdated Show resolved Hide resolved
src/endpoints/sources.js Outdated Show resolved Hide resolved
src/endpoints/sources.js Outdated Show resolved Hide resolved
src/endpoints/sources.js Outdated Show resolved Hide resolved
src/endpoints/sources.js Outdated Show resolved Hide resolved
manifest/constants.js Outdated Show resolved Hide resolved
manifest/main.js Outdated Show resolved Hide resolved
manifest/main.js Outdated Show resolved Hide resolved
manifest/main.js Outdated Show resolved Hide resolved
manifest/main.js Outdated Show resolved Hide resolved
@tsibley
Copy link
Member

tsibley commented Oct 18, 2023

@jameshadfield Thinking about my review and suggestions some more, I wanted to offer to implement some of the larger changes I've suggested and then walk you thru them if it'd be more clear for me to do that rather than just discuss them. Not a problem if not. Totally your preference here.

@jameshadfield jameshadfield force-pushed the james/resources branch 5 times, most recently from b4c57c4 to 354a458 Compare October 19, 2023 20:55
@jameshadfield
Copy link
Member Author

@tsibley could you re-review this at your convenience? All comments should have been addressed, except for (i) log statements which I'll remove later and (ii) the definition of "closest" / redirect to closest code which I'll defer until the wider group makes a decision on desired behavior here. The S3 resources index has been updated to the new format.

I wanted to offer to implement some of the larger changes I've suggested and then walk you thru them if it'd be more clear for me to do that rather than just discuss them

Thanks, but I'd prefer to discuss the concepts rather than read through an implementation. For what it's worth, none of the changes were large in terms of code changes, but conceptually I guess there are a few changes.

@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-james-reso-bce2le October 23, 2023 18:49 Inactive
@tsibley
Copy link
Member

tsibley commented Nov 1, 2023

Some other todos:

Some of these are deferrable but they all need doing sooner than later.

@jameshadfield
Copy link
Member Author

@tsibley could you work through your comments when you have a chance? I believe I've addressed them all. (And please resolve the ones you feel are addressed, else with this many comments the PR becomes hard to read through.)

Adopt AWS config support added to master since this was branched. Probably swap away from v2 SDK and add necessary functions to src/s3.js?

Could you add an in-line comment regarding this please. I presume you're talking about the AWS calls within updateResourceVersions?

src/resourceIndex.js Outdated Show resolved Hide resolved
@tsibley
Copy link
Member

tsibley commented Nov 7, 2023

@jameshadfield Yep. Was updating myself a bit on this today. Will continue to dive in more.

Could you add an in-line comment regarding this please. I presume you're talking about the AWS calls within updateResourceVersions?

Yes, those. Comment added.

package-lock.json Outdated Show resolved Hide resolved
resourceIndexer/coreStagingS3.js Outdated Show resolved Hide resolved
resourceIndexer/coreStagingS3.js Outdated Show resolved Hide resolved
resourceIndexer/coreStagingS3.js Show resolved Hide resolved
resourceIndexer/coreStagingS3.js Outdated Show resolved Hide resolved
src/config.js Outdated Show resolved Hide resolved
src/sources/core.js Outdated Show resolved Hide resolved
src/sources/models.js Outdated Show resolved Hide resolved
src/resourceIndex.js Outdated Show resolved Hide resolved
src/sources/models.js Outdated Show resolved Hide resolved
resourceIndexer/main.js Outdated Show resolved Hide resolved
@tsibley
Copy link
Member

tsibley commented Nov 9, 2023

I've done another full review here, including the new code and revisiting past comments and resolving them as appropriate.

This sets out the pattern for reading S3 inventories and turning them
into resource collections. The JSON output will ultimately be used by
nextstrain.org to both provide a listing of available resources and to
be queried by versioned dataset requests (in order to go from a
requested date to the corresponding S3 version IDs of the relevant
objects).

Eventually this flat JSON file may be replaced with a database,
but for now this is a simple way to introduce the functionality. The
collected resources JSON for core + staging is a ~3.2Mb JSON file
(gzipped). When naively loaded into node it increases the total size of
the allocated heap (V8) by ~60Mb (presumably this would be reduced by
mapping certain string constants to variables).

Currently only working for S3 buckets nextstrain-data and
nextstrain-staging. Narratives are not yet considered, in part because
they are not stored on S3.

`node resourceIndexer/main.js --help` for how to run. AWS credentials
with permission to read s3://nextstrain-inventories will need to be set
in the usual way.
Parses a pre-computed index JSON and stores the data in-memory (on the
nextstrain.org server). ResourceVersions is a class which
dataset/narrative requests can use to get the the (versioned) file URLs
for available subresources which were present for a given YYYY-MM-DD. A
subsequent commit will allow the usage of "@YYYY-DD-MM" descriptors in
URLs, and eventually these data will be used to display all available
resources.

A subsequent commit in this PR will add documentation regarding the
RESOURCE_INDEX variable, but it's design was influenced by
<#719 (comment)>.
Briefly, to use a local index set `RESOURCE_INDEX="./path/to/json"` and
to disable the index use `RESOURCE_INDEX="false"`.
Nextstrain URLs are extended to allow <path>@<version> syntax for core
datasets. Currently the <version> must be in YYYY-MM-DD format. The
returned version is the one which was the latest on the requested day.
If the requested version predates any datasets we return 404.

We attempt to extract a version descriptor for every request, however
for non-core sources (and core narratives)¹ the presence of such a
descriptor (i.e. if "@" is in the URL²) will result in a 400 (Bad
Request) error. For further examples please see
`test/date_descriptor.test.js`.

Note that the @YYYY-MM-DD URLs enabled by this commit look similar to
some existing URLs where the datestamp is in the dataset name (e.g.
"ncov/gisaid/global/6m/2024-01-02") however conceptually these are quite
different.

¹ Fetch URLs (e.g. /fetch/...) are excluded
² There are exceptions, such as community URLs allowing <repo>@<commit>
in the pathname.
Includes documentation of the AWS changes which are not under terraform
control, as well as an overview of the general concepts.
See added documentation for corresponding AWS details. Given that the
resources all come from public-facing buckets (core & staging) it seems
ok to run this from a public repo, but we may want to revisit this once
we start consuming private data.

The index is only generated for datasets (not intermediate files) and
only for the core bucket (nextstrain-data) as the that's all that's
currently handled by the server, so it saves us a little s3 storage,
transfer overhead and server memory footprint. Future work
listing/visualising all available data will use this and so this
filtering is only temporary.
@jameshadfield
Copy link
Member Author

As well as addressing all the open conversations above, I've added a large number of tests to give confidence to the behaviour across sources and resource types. I'm going to merge this now, but feel free to make suggestions which we can address in future PRs. There's more work I'd like to do now that the server has access to the resource index. Many thanks to the reviewers - it's been a big PR, almost certainly with the most comments of any in Nextstrain's history.

P.S. I couldn't convince Heroku to rebuild the review app (unrelated to the contents of this PR), but I did a final round of testing via dev.nextstrain.org, which has the benefit of allowing logins.

@jameshadfield jameshadfield marked this pull request as ready for review January 4, 2024 18:47
@jameshadfield jameshadfield changed the title Allow versioned resource access (core/staging) Allow versioned resource access for core datasets Jan 4, 2024
@jameshadfield jameshadfield merged commit c3dd726 into master Jan 4, 2024
7 checks passed
@jameshadfield jameshadfield deleted the james/resources branch January 4, 2024 18:47
jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request Jan 15, 2024
tsibley added a commit to nextstrain/cli that referenced this pull request Jan 18, 2024
…esources

Using the same @YYYY-MM-DD suffix syntax as on the web.  Support for
this server-side is recently landed.¹

¹ <nextstrain/nextstrain.org#719>
@tsibley tsibley mentioned this pull request Aug 13, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants