-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List datasets + files from core + staging sources #700
Conversation
Sets out the main design ideas behind the collection of resources by the server and making these available via a new API. See commentary added in code for more details.
API design is WIP - see commentary added in this commit There is subtlety in how we want prefix-based filtering to work, as the _name_ of a set of datasets is not always the same as the filename, as we have the ability to group multiple filenames (keys) together. E.g. given the following collection of datasets: - name: X_Y - versions: [ - name: X_Y (current version) - name: X_Y (old version) - name: X_Y_2022-01-01 (current version, datestamped filename) ] Then prefix-based filtering on "X/Y/2022-01-01" could either return nothing as it doesn't match the name assigned to the collection, or return the dataset collection restricted to 1/3 of the versions. Here I implement the former.
Code is taken from my existing prototype of the cards UI at https://github.com/nextstrain/workflow-asset-summary/tree/6730c2544d31b63bbe274c96b31e524176e03ca2 with widespread modifications to adapt it to this repo's design including: - typescript -> javascript (a big downgrade) - CSS changes - API response schema is slightly different - disables the "drawer" drop-down for cards as that functionality is not possible using the approach from the prototype (due to d3) - removal of version "rugplot", which relied on d3 controlling the DOM - rewriting the filtering algorithm - the card hierarchy is created client-side rather than server side
The spinner design is copied from multiple other Nextstrain projects. (I was surprised this wasn't already part of nextstrain.org)
Reflects the new cards UI as the main interface to the datasets. The previous datasets+narratives table is restricted to narratives-only, as these are not yet (or may never be) part of the new cards UI. The colourful tiles are restricted to 6 pathogens, which on most screen resolutions render as 3 columns x 2 rows. The removed pathogens are not being regularly updated, and the ultimate aim is for datasets to be highlighted through the new UI.
The new URL ( "/pathogens/inputs") is not advertised elsewhere as this page will probably see modifications in the short term. We need to decide what kind of files we want listed and how, if at all, we should link these to their associated dataset.
The previous dev-only approach is preserved under a `LOCAL_INVENTORY` env flag.
The staging page was previously broken due to an underlying API failure (?) so here we essentially replace the entire page. The spark-line is slightly misleading for sources where we don't store past versions of datasets (in this case, the underlying S3 bucket isn't versioned). Nonetheless, some datasets are grouped together by our name munging rules. I didn't implement a files (inputs) page for staging as we essentially have no non-dataset files in the bucket!
This uses a simple interval-based approach, which should be enough for the core & staging sources. Ideally we would re-fetch the inventory shortly after it's creation. The time of day at which a new inventory appears seems random, so this would require using events or polling the inventory directory more frequently and only updating when a new manifest appears. The directory name of inventory manifests seems to be consistently YYYY-MM-DD + "T01-00Z", so perhaps it's enough to make HEAD requests for the upcoming day's data? 2023-07-12T01-00Z/
) | ||
} | ||
|
||
function Details({data, view}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Potentially blocking] The ability to show more detailed information about versions below a card has been disabled here - partly as it requires a bit of polish, and partly as it used d3 to manipulate the DOM and d3 + gatsby = 😡.
Because we group datesamped datasets with the non-datestamped versions (i.e. .../6m/2023-01-01
is shown as a version of .../6m
, rather than its own dataset), there is currently no way to actually access this datestamped dataset. These were exposed in the (now removed) table, so this is arguably a downgrade. Although that table was incomplete, so...
const [state, setState] = useState([]); | ||
const counts = {}; | ||
useEffect(() => { | ||
console.log("----- useEffect // useFilterOptions") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logs will be removed before merge, but they may be helpful for people during review
Array.from(files).map(([, objects]) => new CollectedResources(this, objects)) | ||
); | ||
// TODO XXX narratives | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an inherent linking between (some) datasets and (some) intermediate files. It would be nice to surface this - e.g. when looking at the card for zika you could see the available intermediate files as well. I ummed and ahhed about trying to build this into this version but ultimately left it as a future todo. Beyond the obvious "how do we programmatically know which datasets are connected to which files", the desired UI needs to be worked out before we know how best to approach it. Note that such a linking is cross-source, so if it's done ahead of time on the server, we'll have to perform a per-linked-resource authz filtering (e.g. the intermediate files for a dataset might be stored in a source you don't have permission to view).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that such a linking is cross-source, so if it's done ahead of time on the server, we'll have to perform a per-linked-resource authz filtering (e.g. the intermediate files for a dataset might be stored in a source you don't have permission to view).
I think this belies a misunderstanding here: that the Source classes are about everything in e.g. a bucket or a GitHub repo. They're very much not. They're just about the datasets and narratives (the Resource classes). I don't think we should be thinking about the intermediates we're trying to expose here thru the lens of our Source classes.
It's been a long time since I've been burned by MacOS' filesystem being case-insensitive, but it happens!
CoreCollectedResources : CoreStagingCollectedResources; | ||
const datasets = new Map(), files = new Map(); | ||
s3objects.forEach((object) => { | ||
const [name, resourceType] = CollectedResources.objectName(object); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grouping of resources is done by those which have the same name, so there are decisions to be made around what name to give things. For instance, I think it's clear that we should group datestamped-datasets with their non-datestamped selves. We should probably also group .zst
and .gz
files together (although here I don't). Would be great to have others' thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance, I think it's clear that we should group datestamped-datasets with their non-datestamped selves.
See my comment elsewhere (and earlier), but I don't agree this is clear!
Mostly a note to self. Things to make sure to review here based on the Blab Nextstrain meeting just now:
|
* We can perform authorization here by either calling | ||
* authz.authorized(user, authz.actions.Read, object), | ||
* See note above for more details. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tsibley threading the authz conversation here. For our other APIs, the authz check is done on a single resource (Resource
subclass instance). Here we have a collection of resources, which may include resources with different filenames etc.
We can easily perform an authorization check on the CollectedResources
instance, but to stay properly aligned with the current approach the check should potentially be on each individual resource within the collection. If the latter is indeed what we want to do it would be another reason to extend the Resource
subclasses such that the individual resources within a collection are themselves instances of Resource
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a mistake to treat /x/y/z and /x/y/z/YYYY-MM-DD as the same thing in the current data model. And thus a mistake to authz them as the same thing too. To do so would be to dramatically undermine the current data model and increase complexity of reasoning about authz.
I know we think of these two things as the same, but it's complicated (as you've seen!) to do so systematically and thus programmatically. Instead, I think we should only rely on more systematic versioning like S3's (and not our ad-hoc dated forms) when collecting the version history of resources. That is, we should only consider the S3 version history of /x/y/z and not include the /x/y/z/YYYY-MM-DD dated forms in the history of /x/y/z. This means downplaying or filtering out the dated forms, but I think that's acceptable (?) since the S3 version history still exists and (thanks to your work here!) we'll have access to the systematic versioning going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read thru this linearly, leaving comments on specifics I saw as I went. I mostly skipped over the UI bits as I wanted to focus on the bigger picture. The major takeaways for me:
- I don't think the internal design/data model here is the right one, as I elaborated elsewhere.
- I don't think we should be collapsing /x/y/z and /x/y/z/YYYY-MM-DD, as I elaborated on elsewhere. I don't think this would be a loss of any actual versions, and it makes the conceptual integration with the codebase a lot lot simpler.
- I think the collection/indexing should be happening in a separate process, external from the web server, producing an artifact (a giant JSON blob, a SQLite database, etc) that's consumed by the web server.
- A lot of the code seems to me to suffer from accidental complexity that's not been managed away/minimized. As accidental complexity builds up it drags down a codebase and makes things harder to understand/modify/maintain. I think of it a bit like barnacles or other biofouling on a ship. I realize some of this may be attributable to being a WIP/prototype, but I think it'll be important to address beyond that point rather than forget about.
Those are all really frank, but don't get me wrong, I think we absolutely want the functionality you've prototyped here! I think it will be an important set of features and foundational to future work, enough that we'll really want it on solid footings.
This might be a good focus (like we've been discussing) for the team to collaborate on? Or at least you and I? At least I'd be interested in elaborating in code instead of just commentary. And maybe you'd be interested in further refining the UI-side in parallel?
@@ -60,7 +60,7 @@ const Groups = (() => { | |||
updateKnownDatasetsAndNarratives, | |||
1000 * 60 * 60 * 6 // 6 hours | |||
); | |||
updateKnownDatasetsAndNarratives(); // initial collection on server start | |||
// updateKnownDatasetsAndNarratives(); // initial collection on server start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the need for this temporary [drop]
commit?
* The entire set of these resources is large, and this API hasn't been | ||
* tested for performance. Currently the set is held in memory and | ||
* filtering operations are performed for each request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For prototyping this is fine, for sure, but I think we should consider other implementations here (SQLite) before we go much further.
* Handler for /charon/getAvailable/V2 requests | ||
* | ||
* WIP: The getAvailable API design should be flexible enough to access | ||
* resources across the nextstrain ecosystem. This is a more flexible | ||
* definition of resource than our Resource class as this may include | ||
* intermediate files (sequences, alignments, metadata) which are not | ||
* fetched by Auspice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this endpoint part of the Charon API, which is tightly coupled with Auspice, instead of elsewhere?
[resourceType, resources.map((resource) => { | ||
return { | ||
name: resource.name, | ||
url: resource.nextstrainUrl(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what nextstrainUrl
is here (haven't read there yet), but it's odd to me to have "nextstrain" in the property name. My guess is that url
is the internal URL and "nextstrain" means the URL on nextstrain.org? If so, maybe call it publicUrl
instead? Or if it's really a path fragment and not a URL, then path
?
* `nextstrainUrl` method on the respective source. | ||
*/ | ||
this._source = source; | ||
[this._name, this._resourceType] = this.constructor.objectName(objects[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basing the properties of a collection on its first element is a common source of bugs (e.g. when that first element is non-representative of the collection as a whole). It would be better to be explicit.
createPage({ | ||
path: "/pathogens/inputs", | ||
component: path.resolve("src/sections/remote-inputs.jsx") | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stuff under files/workflows/…
and files/datasets/…
is explicitly not just "inputs", so even though this page is preliminary I think we should be calling it something else.
if (err) reject(err); | ||
const orderedKeys = data.Contents | ||
.map((object) => object.Key) | ||
.filter((key) => key.endsWith('/manifest.json')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to explicitly match and filter on the whole key, e.g. (psuedo-code, needs regexp metachar escaping)
.filter(key => key.match(new RegExp(`${prefix}/\d{4}-\d{2}-\d{2}T\d{2}-\d{2}Z/manifest[.]json$`)))
or the equivalent behaviour via another approach.
Otherwise it's easy for this to start mismatching unexpectedly in the future.
/* NOTE - the Body property of the response is a Buffer in S3 SDK v2 (which we use) | ||
but this has changed to a readable stream in SDK v3 */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We previously used v3 in other parts of the codebase, but that's (IIRC) since been removed with the wrap up of the groups migration. I think we should be using v3 in new code (which is why I had done so previously for the groups migration tooling).
CoreCollectedResources : CoreStagingCollectedResources; | ||
const datasets = new Map(), files = new Map(); | ||
s3objects.forEach((object) => { | ||
const [name, resourceType] = CollectedResources.objectName(object); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance, I think it's clear that we should group datestamped-datasets with their non-datestamped selves.
See my comment elsewhere (and earlier), but I don't agree this is clear!
// The following is a MVP to re-collect every 24 hours | ||
setInterval( | ||
async () => { | ||
for (const [,source] of sources) { | ||
// collect resources (fetch, parse, store) sequentially | ||
await source.collectResources(); | ||
} | ||
}, | ||
1000 * 60 * 60 * 24 // 24 hours in ms | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be doing this sort of collection/indexing inside the web server process.
… with changes made by James on the AWS Console. I copied the policy¹ as JSON from AWS Console and pasted directly into the file. After reordering to match the existing file contents, which are sorted to be more readable rather than alphabetical, I confirmed that this updated version of the file results in no changes with `terraform plan`. ¹ arn:aws:iam::827581582529:policy/NextstrainDotOrgServerInstance-testing Co-authored-by: James Hadfield <[email protected]>
aed9400
to
a00b8e1
Compare
This is a pretty major PR and most detail is in commit messages + comments in the code. A high level overview of the functionality introduced:
s3://nextstrain-data
ands3://nextstrain-staging
, respectively) we create daily inventories on S3, which are stored in the private buckets3://nextstrain-inventories
. These are essentially free to generate (<$1/year/bucket)./pathogens
page shows the core datasets using the UI I recently prototyped. A new/pathogens/inputs
page shows the core files. The/staging
page (previously not working) also now uses this new UI.As is the case with feature pushes of this scope there is no clear end point. There are a huge number of potential improvements we can make, so for this PR please indicate if you consider something blocking vs an improvement we can tackle in subsequent PRs 🙏
From my point of view, the following needs to be done before merge (in the same vein as how I approached the prototype, I'm trying to share work at an early stage):
I'll make some notes via in-line comments about feature pushes I don't think are blocking here, but which come up as a result of this work, so that discussions can be threaded.
Testing
It should 🤞 all work via review apps. To test locally, you can avoid the S3 API calls by creating a (git-ignored)
./devData
folder and adding a manifest+inventory for each bucket with the following filenames:(You can pick any day's inventory, that's not so important for dev purposes.) Then run the server with a
LOCAL_INVENTORY=true
environment variable.P.S. manifest JSON here refers to the S3 inventory manifest and is completely unrelated to our existing usage of manifest JSONs. However this work should eventually allow us to remove those manifests. So that's nice.