Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List datasets + files from core + staging sources #700

Closed
wants to merge 13 commits into from

Conversation

jameshadfield
Copy link
Member

This is a pretty major PR and most detail is in commit messages + comments in the code. A high level overview of the functionality introduced:

  • For our core & staging sources (s3://nextstrain-data and s3://nextstrain-staging, respectively) we create daily inventories on S3, which are stored in the private bucket s3://nextstrain-inventories. These are essentially free to generate (<$1/year/bucket).
  • Our server pulls these inventories down daily and maintains a set of datasets and files, including past versions of each where available.
  • A new API is introduced to access this data in a cross-source fashion, including filtering.
  • The /pathogens page shows the core datasets using the UI I recently prototyped. A new /pathogens/inputs page shows the core files. The /staging page (previously not working) also now uses this new UI.

As is the case with feature pushes of this scope there is no clear end point. There are a huge number of potential improvements we can make, so for this PR please indicate if you consider something blocking vs an improvement we can tackle in subsequent PRs 🙏

From my point of view, the following needs to be done before merge (in the same vein as how I approached the prototype, I'm trying to share work at an early stage):

  • Client-side error handling if the API calls 404 / time out / contain no data.
  • Amend the production IAM policy to have access to the new bucket (review apps use the testing one which I’ve already modified)
  • [Could be done after merge] Implement a S3 lifecycle to expire inventory objects after 30 days (?)

I'll make some notes via in-line comments about feature pushes I don't think are blocking here, but which come up as a result of this work, so that discussions can be threaded.

Testing

It should 🤞 all work via review apps. To test locally, you can avoid the S3 API calls by creating a (git-ignored) ./devData folder and adding a manifest+inventory for each bucket with the following filenames:

./devData/core.manifest.json          ./devData/core.inventory.csv.gz
./devData/staging.manifest.json       ./devData/staging.inventory.csv.gz

(You can pick any day's inventory, that's not so important for dev purposes.) Then run the server with a LOCAL_INVENTORY=true environment variable.

P.S. manifest JSON here refers to the S3 inventory manifest and is completely unrelated to our existing usage of manifest JSONs. However this work should eventually allow us to remove those manifests. So that's nice.

Sets out the main design ideas behind the collection of
resources by the server and making these available via a new API.
See commentary added in code for more details.
API design is WIP - see commentary added in this commit

There is subtlety in how we want prefix-based filtering to work, as the
_name_ of a set of datasets is not always the same as the filename,
as we have the ability to group multiple filenames (keys) together.
E.g. given the following collection of datasets:
  - name: X_Y
  - versions: [
    - name: X_Y (current version)
    - name: X_Y (old version)
    - name: X_Y_2022-01-01 (current version, datestamped filename)
  ]
Then prefix-based filtering on "X/Y/2022-01-01" could either
return nothing as it doesn't match the name assigned to the collection,
or return the dataset collection restricted to 1/3 of the versions.
Here I implement the former.
Code is taken from my existing prototype of the cards UI at
https://github.com/nextstrain/workflow-asset-summary/tree/6730c2544d31b63bbe274c96b31e524176e03ca2
with widespread modifications to adapt it to this repo's
design including:
 - typescript -> javascript (a big downgrade)
 - CSS changes
 - API response schema is slightly different
 - disables the "drawer" drop-down for cards as that functionality is not
possible using the approach from the prototype (due to d3)
 - removal of version "rugplot", which relied on d3 controlling the DOM
 - rewriting the filtering algorithm
 - the card hierarchy is created client-side rather than server side
The spinner design is copied from multiple other Nextstrain projects.
(I was surprised this wasn't already part of nextstrain.org)
Reflects the new cards UI as the main interface to the datasets. The
previous datasets+narratives table is restricted to narratives-only,
as these are not yet (or may never be) part of the new cards UI.

The colourful tiles are restricted to 6 pathogens, which on most screen
resolutions render as 3 columns x 2 rows. The removed pathogens are
not being regularly updated, and the ultimate aim is for datasets to be
highlighted through the new UI.
The new URL ( "/pathogens/inputs") is not advertised elsewhere as this
page will probably see modifications in the short term. We need to
decide what kind of files we want listed and how, if at all, we should
link these to their associated dataset.
The previous dev-only approach is preserved under a `LOCAL_INVENTORY`
env flag.
The staging page was previously broken due to an underlying API
failure (?) so here we essentially replace the entire page.

The spark-line is slightly misleading for sources where we don't store
past versions of datasets (in this case, the underlying S3 bucket isn't
versioned). Nonetheless, some datasets are grouped together by our
name munging rules.

I didn't implement a files (inputs) page for staging as we essentially
have no non-dataset files in the bucket!
This uses a simple interval-based approach, which should be enough for
the core & staging sources.

Ideally we would re-fetch the inventory shortly after it's creation.
The time of day at which a new inventory appears seems random, so this
would require using events or polling the inventory directory more
frequently and only updating when a new manifest appears.
The directory name of inventory manifests seems to be consistently
YYYY-MM-DD + "T01-00Z", so perhaps it's enough to make HEAD requests
for the upcoming day's data?

2023-07-12T01-00Z/
@jameshadfield jameshadfield requested a review from a team July 13, 2023 03:13
@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-list-s3-re-epcjm5 July 13, 2023 03:13 Inactive
)
}

function Details({data, view}) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Potentially blocking] The ability to show more detailed information about versions below a card has been disabled here - partly as it requires a bit of polish, and partly as it used d3 to manipulate the DOM and d3 + gatsby = 😡.

Because we group datesamped datasets with the non-datestamped versions (i.e. .../6m/2023-01-01 is shown as a version of .../6m, rather than its own dataset), there is currently no way to actually access this datestamped dataset. These were exposed in the (now removed) table, so this is arguably a downgrade. Although that table was incomplete, so...

const [state, setState] = useState([]);
const counts = {};
useEffect(() => {
console.log("----- useEffect // useFilterOptions")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logs will be removed before merge, but they may be helpful for people during review

Array.from(files).map(([, objects]) => new CollectedResources(this, objects))
);
// TODO XXX narratives
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an inherent linking between (some) datasets and (some) intermediate files. It would be nice to surface this - e.g. when looking at the card for zika you could see the available intermediate files as well. I ummed and ahhed about trying to build this into this version but ultimately left it as a future todo. Beyond the obvious "how do we programmatically know which datasets are connected to which files", the desired UI needs to be worked out before we know how best to approach it. Note that such a linking is cross-source, so if it's done ahead of time on the server, we'll have to perform a per-linked-resource authz filtering (e.g. the intermediate files for a dataset might be stored in a source you don't have permission to view).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that such a linking is cross-source, so if it's done ahead of time on the server, we'll have to perform a per-linked-resource authz filtering (e.g. the intermediate files for a dataset might be stored in a source you don't have permission to view).

I think this belies a misunderstanding here: that the Source classes are about everything in e.g. a bucket or a GitHub repo. They're very much not. They're just about the datasets and narratives (the Resource classes). I don't think we should be thinking about the intermediates we're trying to expose here thru the lens of our Source classes.

It's been a long time since I've been burned by MacOS' filesystem
being case-insensitive, but it happens!
@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-list-s3-re-pmdthy July 13, 2023 03:43 Inactive
CoreCollectedResources : CoreStagingCollectedResources;
const datasets = new Map(), files = new Map();
s3objects.forEach((object) => {
const [name, resourceType] = CollectedResources.objectName(object);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grouping of resources is done by those which have the same name, so there are decisions to be made around what name to give things. For instance, I think it's clear that we should group datestamped-datasets with their non-datestamped selves. We should probably also group .zst and .gz files together (although here I don't). Would be great to have others' thoughts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, I think it's clear that we should group datestamped-datasets with their non-datestamped selves.

See my comment elsewhere (and earlier), but I don't agree this is clear!

@tsibley
Copy link
Member

tsibley commented Jul 13, 2023

Mostly a note to self. Things to make sure to review here based on the Blab Nextstrain meeting just now:

  • Lifecycle policies for s3://nextstrain-inventories
  • Terraform config for s3://nextstrain-inventories
  • Terraform config for IAM policies
  • How does/can this relate to resource-listing endpoints for use by RESTful API clients?
  • Lifecycle policies for other S3 buckets with trial/test/dev stuff… what are we missing?

@tsibley tsibley self-requested a review July 13, 2023 19:32
* We can perform authorization here by either calling
* authz.authorized(user, authz.actions.Read, object),
* See note above for more details.
*/
Copy link
Member Author

@jameshadfield jameshadfield Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsibley threading the authz conversation here. For our other APIs, the authz check is done on a single resource (Resource subclass instance). Here we have a collection of resources, which may include resources with different filenames etc.

We can easily perform an authorization check on the CollectedResources instance, but to stay properly aligned with the current approach the check should potentially be on each individual resource within the collection. If the latter is indeed what we want to do it would be another reason to extend the Resource subclasses such that the individual resources within a collection are themselves instances of Resource.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a mistake to treat /x/y/z and /x/y/z/YYYY-MM-DD as the same thing in the current data model. And thus a mistake to authz them as the same thing too. To do so would be to dramatically undermine the current data model and increase complexity of reasoning about authz.

I know we think of these two things as the same, but it's complicated (as you've seen!) to do so systematically and thus programmatically. Instead, I think we should only rely on more systematic versioning like S3's (and not our ad-hoc dated forms) when collecting the version history of resources. That is, we should only consider the S3 version history of /x/y/z and not include the /x/y/z/YYYY-MM-DD dated forms in the history of /x/y/z. This means downplaying or filtering out the dated forms, but I think that's acceptable (?) since the S3 version history still exists and (thanks to your work here!) we'll have access to the systematic versioning going forward.

Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read thru this linearly, leaving comments on specifics I saw as I went. I mostly skipped over the UI bits as I wanted to focus on the bigger picture. The major takeaways for me:

  • I don't think the internal design/data model here is the right one, as I elaborated elsewhere.
  • I don't think we should be collapsing /x/y/z and /x/y/z/YYYY-MM-DD, as I elaborated on elsewhere. I don't think this would be a loss of any actual versions, and it makes the conceptual integration with the codebase a lot lot simpler.
  • I think the collection/indexing should be happening in a separate process, external from the web server, producing an artifact (a giant JSON blob, a SQLite database, etc) that's consumed by the web server.
  • A lot of the code seems to me to suffer from accidental complexity that's not been managed away/minimized. As accidental complexity builds up it drags down a codebase and makes things harder to understand/modify/maintain. I think of it a bit like barnacles or other biofouling on a ship. I realize some of this may be attributable to being a WIP/prototype, but I think it'll be important to address beyond that point rather than forget about.

Those are all really frank, but don't get me wrong, I think we absolutely want the functionality you've prototyped here! I think it will be an important set of features and foundational to future work, enough that we'll really want it on solid footings.

This might be a good focus (like we've been discussing) for the team to collaborate on? Or at least you and I? At least I'd be interested in elaborating in code instead of just commentary. And maybe you'd be interested in further refining the UI-side in parallel?

@@ -60,7 +60,7 @@ const Groups = (() => {
updateKnownDatasetsAndNarratives,
1000 * 60 * 60 * 6 // 6 hours
);
updateKnownDatasetsAndNarratives(); // initial collection on server start
// updateKnownDatasetsAndNarratives(); // initial collection on server start
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the need for this temporary [drop] commit?

Comment on lines +12 to +14
* The entire set of these resources is large, and this API hasn't been
* tested for performance. Currently the set is held in memory and
* filtering operations are performed for each request.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For prototyping this is fine, for sure, but I think we should consider other implementations here (SQLite) before we go much further.

Comment on lines +4 to +10
* Handler for /charon/getAvailable/V2 requests
*
* WIP: The getAvailable API design should be flexible enough to access
* resources across the nextstrain ecosystem. This is a more flexible
* definition of resource than our Resource class as this may include
* intermediate files (sequences, alignments, metadata) which are not
* fetched by Auspice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this endpoint part of the Charon API, which is tightly coupled with Auspice, instead of elsewhere?

[resourceType, resources.map((resource) => {
return {
name: resource.name,
url: resource.nextstrainUrl(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what nextstrainUrl is here (haven't read there yet), but it's odd to me to have "nextstrain" in the property name. My guess is that url is the internal URL and "nextstrain" means the URL on nextstrain.org? If so, maybe call it publicUrl instead? Or if it's really a path fragment and not a URL, then path?

* `nextstrainUrl` method on the respective source.
*/
this._source = source;
[this._name, this._resourceType] = this.constructor.objectName(objects[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basing the properties of a collection on its first element is a common source of bugs (e.g. when that first element is non-representative of the collection as a whole). It would be better to be explicit.

Comment on lines +258 to +261
createPage({
path: "/pathogens/inputs",
component: path.resolve("src/sections/remote-inputs.jsx")
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stuff under files/workflows/… and files/datasets/… is explicitly not just "inputs", so even though this page is preliminary I think we should be calling it something else.

if (err) reject(err);
const orderedKeys = data.Contents
.map((object) => object.Key)
.filter((key) => key.endsWith('/manifest.json'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to explicitly match and filter on the whole key, e.g. (psuedo-code, needs regexp metachar escaping)

.filter(key => key.match(new RegExp(`${prefix}/\d{4}-\d{2}-\d{2}T\d{2}-\d{2}Z/manifest[.]json$`)))

or the equivalent behaviour via another approach.

Otherwise it's easy for this to start mismatching unexpectedly in the future.

Comment on lines +66 to +67
/* NOTE - the Body property of the response is a Buffer in S3 SDK v2 (which we use)
but this has changed to a readable stream in SDK v3 */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We previously used v3 in other parts of the codebase, but that's (IIRC) since been removed with the wrap up of the groups migration. I think we should be using v3 in new code (which is why I had done so previously for the groups migration tooling).

CoreCollectedResources : CoreStagingCollectedResources;
const datasets = new Map(), files = new Map();
s3objects.forEach((object) => {
const [name, resourceType] = CollectedResources.objectName(object);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, I think it's clear that we should group datestamped-datasets with their non-datestamped selves.

See my comment elsewhere (and earlier), but I don't agree this is clear!

Comment on lines +33 to +42
// The following is a MVP to re-collect every 24 hours
setInterval(
async () => {
for (const [,source] of sources) {
// collect resources (fetch, parse, store) sequentially
await source.collectResources();
}
},
1000 * 60 * 60 * 24 // 24 hours in ms
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be doing this sort of collection/indexing inside the web server process.

@victorlin victorlin mentioned this pull request Oct 4, 2023
2 tasks
… with changes made by James on the AWS Console.

I copied the policy¹ as JSON from AWS Console and pasted directly into
the file. After reordering to match the existing file contents, which
are sorted to be more readable rather than alphabetical, I confirmed
that this updated version of the file results in no changes with
`terraform plan`.

¹ arn:aws:iam::827581582529:policy/NextstrainDotOrgServerInstance-testing

Co-authored-by: James Hadfield <[email protected]>
@trvrb trvrb temporarily deployed to nextstrain-s-list-s3-re-quummv October 23, 2023 18:57 Inactive
@jameshadfield
Copy link
Member Author

Replaced by #803 and #719

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

5 participants