-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow versioned resource access for core datasets #719
Conversation
39f79c8
to
3343d50
Compare
bf89818
to
0aaf7fa
Compare
0aaf7fa
to
575c95a
Compare
Talked with James in our 1:1. This is ready for me to review. He chose not to implement a new ResourceVersion class (as in Source → Resource → ResourceVersion → SubResource) as we'd talked about two weeks ago, but instead make the existing Resource class version-aware. |
575c95a
to
4a0367f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read thru the commits today in reverse order. Didn't quite make it thru the first commit (last in order I reviewed), the one introducing the manifest/
directory. I'll finish that in the next day or so but wanted to checkpoint my comments so far. A bunch of stuff to comment on, but this feels mostly headed in the right direction.
4a0367f
to
bdf7175
Compare
@jameshadfield Thinking about my review and suggestions some more, I wanted to offer to implement some of the larger changes I've suggested and then walk you thru them if it'd be more clear for me to do that rather than just discuss them. Not a problem if not. Totally your preference here. |
b4c57c4
to
354a458
Compare
@tsibley could you re-review this at your convenience? All comments should have been addressed, except for (i) log statements which I'll remove later and (ii) the definition of "closest" / redirect to closest code which I'll defer until the wider group makes a decision on desired behavior here. The S3 resources index has been updated to the new format.
Thanks, but I'd prefer to discuss the concepts rather than read through an implementation. For what it's worth, none of the changes were large in terms of code changes, but conceptually I guess there are a few changes. |
Some other todos:
Some of these are deferrable but they all need doing sooner than later. |
255a329
to
d10c484
Compare
@tsibley could you work through your comments when you have a chance? I believe I've addressed them all. (And please resolve the ones you feel are addressed, else with this many comments the PR becomes hard to read through.)
Could you add an in-line comment regarding this please. I presume you're talking about the AWS calls within |
@jameshadfield Yep. Was updating myself a bit on this today. Will continue to dive in more.
Yes, those. Comment added. |
I've done another full review here, including the new code and revisiting past comments and resolving them as appropriate. |
This sets out the pattern for reading S3 inventories and turning them into resource collections. The JSON output will ultimately be used by nextstrain.org to both provide a listing of available resources and to be queried by versioned dataset requests (in order to go from a requested date to the corresponding S3 version IDs of the relevant objects). Eventually this flat JSON file may be replaced with a database, but for now this is a simple way to introduce the functionality. The collected resources JSON for core + staging is a ~3.2Mb JSON file (gzipped). When naively loaded into node it increases the total size of the allocated heap (V8) by ~60Mb (presumably this would be reduced by mapping certain string constants to variables). Currently only working for S3 buckets nextstrain-data and nextstrain-staging. Narratives are not yet considered, in part because they are not stored on S3. `node resourceIndexer/main.js --help` for how to run. AWS credentials with permission to read s3://nextstrain-inventories will need to be set in the usual way.
Parses a pre-computed index JSON and stores the data in-memory (on the nextstrain.org server). ResourceVersions is a class which dataset/narrative requests can use to get the the (versioned) file URLs for available subresources which were present for a given YYYY-MM-DD. A subsequent commit will allow the usage of "@YYYY-DD-MM" descriptors in URLs, and eventually these data will be used to display all available resources. A subsequent commit in this PR will add documentation regarding the RESOURCE_INDEX variable, but it's design was influenced by <#719 (comment)>. Briefly, to use a local index set `RESOURCE_INDEX="./path/to/json"` and to disable the index use `RESOURCE_INDEX="false"`.
Nextstrain URLs are extended to allow <path>@<version> syntax for core datasets. Currently the <version> must be in YYYY-MM-DD format. The returned version is the one which was the latest on the requested day. If the requested version predates any datasets we return 404. We attempt to extract a version descriptor for every request, however for non-core sources (and core narratives)¹ the presence of such a descriptor (i.e. if "@" is in the URL²) will result in a 400 (Bad Request) error. For further examples please see `test/date_descriptor.test.js`. Note that the @YYYY-MM-DD URLs enabled by this commit look similar to some existing URLs where the datestamp is in the dataset name (e.g. "ncov/gisaid/global/6m/2024-01-02") however conceptually these are quite different. ¹ Fetch URLs (e.g. /fetch/...) are excluded ² There are exceptions, such as community URLs allowing <repo>@<commit> in the pathname.
Includes documentation of the AWS changes which are not under terraform control, as well as an overview of the general concepts.
See added documentation for corresponding AWS details. Given that the resources all come from public-facing buckets (core & staging) it seems ok to run this from a public repo, but we may want to revisit this once we start consuming private data. The index is only generated for datasets (not intermediate files) and only for the core bucket (nextstrain-data) as the that's all that's currently handled by the server, so it saves us a little s3 storage, transfer overhead and server memory footprint. Future work listing/visualising all available data will use this and so this filtering is only temporary.
0b56c84
to
c0ee239
Compare
As well as addressing all the open conversations above, I've added a large number of tests to give confidence to the behaviour across sources and resource types. I'm going to merge this now, but feel free to make suggestions which we can address in future PRs. There's more work I'd like to do now that the server has access to the resource index. Many thanks to the reviewers - it's been a big PR, almost certainly with the most comments of any in Nextstrain's history. P.S. I couldn't convince Heroku to rebuild the review app (unrelated to the contents of this PR), but I did a final round of testing via dev.nextstrain.org, which has the benefit of allowing logins. |
Corresponding nextstrain PR <nextstrain/nextstrain.org#719>
…esources Using the same @YYYY-MM-DD suffix syntax as on the web. Support for this server-side is recently landed.¹ ¹ <nextstrain/nextstrain.org#719>
See commit messages for details. Some URLs as an example:
Ready for review, but there are a few things before this can be merged: