List datasets + files from core + staging sources #700

jameshadfield · 2023-07-13T03:12:55Z

This is a pretty major PR and most detail is in commit messages + comments in the code. A high level overview of the functionality introduced:

For our core & staging sources (s3://nextstrain-data and s3://nextstrain-staging, respectively) we create daily inventories on S3, which are stored in the private bucket s3://nextstrain-inventories. These are essentially free to generate (<$1/year/bucket).
Our server pulls these inventories down daily and maintains a set of datasets and files, including past versions of each where available.
A new API is introduced to access this data in a cross-source fashion, including filtering.
The /pathogens page shows the core datasets using the UI I recently prototyped. A new /pathogens/inputs page shows the core files. The /staging page (previously not working) also now uses this new UI.

As is the case with feature pushes of this scope there is no clear end point. There are a huge number of potential improvements we can make, so for this PR please indicate if you consider something blocking vs an improvement we can tackle in subsequent PRs 🙏

From my point of view, the following needs to be done before merge (in the same vein as how I approached the prototype, I'm trying to share work at an early stage):

Client-side error handling if the API calls 404 / time out / contain no data.
Amend the production IAM policy to have access to the new bucket (review apps use the testing one which I’ve already modified)
[Could be done after merge] Implement a S3 lifecycle to expire inventory objects after 30 days (?)

I'll make some notes via in-line comments about feature pushes I don't think are blocking here, but which come up as a result of this work, so that discussions can be threaded.

Testing

It should 🤞 all work via review apps. To test locally, you can avoid the S3 API calls by creating a (git-ignored) ./devData folder and adding a manifest+inventory for each bucket with the following filenames:

./devData/core.manifest.json          ./devData/core.inventory.csv.gz
./devData/staging.manifest.json       ./devData/staging.inventory.csv.gz

(You can pick any day's inventory, that's not so important for dev purposes.) Then run the server with a LOCAL_INVENTORY=true environment variable.

P.S. manifest JSON here refers to the S3 inventory manifest and is completely unrelated to our existing usage of manifest JSONs. However this work should eventually allow us to remove those manifests. So that's nice.

Sets out the main design ideas behind the collection of resources by the server and making these available via a new API. See commentary added in code for more details.

API design is WIP - see commentary added in this commit There is subtlety in how we want prefix-based filtering to work, as the _name_ of a set of datasets is not always the same as the filename, as we have the ability to group multiple filenames (keys) together. E.g. given the following collection of datasets: - name: X_Y - versions: [ - name: X_Y (current version) - name: X_Y (old version) - name: X_Y_2022-01-01 (current version, datestamped filename) ] Then prefix-based filtering on "X/Y/2022-01-01" could either return nothing as it doesn't match the name assigned to the collection, or return the dataset collection restricted to 1/3 of the versions. Here I implement the former.

Code is taken from my existing prototype of the cards UI at https://github.com/nextstrain/workflow-asset-summary/tree/6730c2544d31b63bbe274c96b31e524176e03ca2 with widespread modifications to adapt it to this repo's design including: - typescript -> javascript (a big downgrade) - CSS changes - API response schema is slightly different - disables the "drawer" drop-down for cards as that functionality is not possible using the approach from the prototype (due to d3) - removal of version "rugplot", which relied on d3 controlling the DOM - rewriting the filtering algorithm - the card hierarchy is created client-side rather than server side

The spinner design is copied from multiple other Nextstrain projects. (I was surprised this wasn't already part of nextstrain.org)

Reflects the new cards UI as the main interface to the datasets. The previous datasets+narratives table is restricted to narratives-only, as these are not yet (or may never be) part of the new cards UI. The colourful tiles are restricted to 6 pathogens, which on most screen resolutions render as 3 columns x 2 rows. The removed pathogens are not being regularly updated, and the ultimate aim is for datasets to be highlighted through the new UI.

The new URL ( "/pathogens/inputs") is not advertised elsewhere as this page will probably see modifications in the short term. We need to decide what kind of files we want listed and how, if at all, we should link these to their associated dataset.

The previous dev-only approach is preserved under a `LOCAL_INVENTORY` env flag.

The staging page was previously broken due to an underlying API failure (?) so here we essentially replace the entire page. The spark-line is slightly misleading for sources where we don't store past versions of datasets (in this case, the underlying S3 bucket isn't versioned). Nonetheless, some datasets are grouped together by our name munging rules. I didn't implement a files (inputs) page for staging as we essentially have no non-dataset files in the bucket!

This uses a simple interval-based approach, which should be enough for the core & staging sources. Ideally we would re-fetch the inventory shortly after it's creation. The time of day at which a new inventory appears seems random, so this would require using events or polling the inventory directory more frequently and only updating when a new manifest appears. The directory name of inventory manifests seems to be consistently YYYY-MM-DD + "T01-00Z", so perhaps it's enough to make HEAD requests for the upcoming day's data? 2023-07-12T01-00Z/

jameshadfield · 2023-07-13T03:18:06Z

static-site/src/components/CardsV2/card-component.jsx

+  )
+}
+
+function Details({data, view}) {


[Potentially blocking] The ability to show more detailed information about versions below a card has been disabled here - partly as it requires a bit of polish, and partly as it used d3 to manipulate the DOM and d3 + gatsby = 😡.

Because we group datesamped datasets with the non-datestamped versions (i.e. .../6m/2023-01-01 is shown as a version of .../6m, rather than its own dataset), there is currently no way to actually access this datestamped dataset. These were exposed in the (now removed) table, so this is arguably a downgrade. Although that table was incomplete, so...

jameshadfield · 2023-07-13T03:19:17Z

static-site/src/components/CardsV2/helpers.js

+  const [state, setState] = useState([]);
+  const counts = {};
+  useEffect(() => {
+    console.log("----- useEffect // useFilterOptions")


Logs will be removed before merge, but they may be helpful for people during review

jameshadfield · 2023-07-13T03:32:55Z

src/sources/core.js

+      Array.from(files).map(([, objects]) => new CollectedResources(this, objects))
+    );
+    // TODO XXX narratives
+  }


There's an inherent linking between (some) datasets and (some) intermediate files. It would be nice to surface this - e.g. when looking at the card for zika you could see the available intermediate files as well. I ummed and ahhed about trying to build this into this version but ultimately left it as a future todo. Beyond the obvious "how do we programmatically know which datasets are connected to which files", the desired UI needs to be worked out before we know how best to approach it. Note that such a linking is cross-source, so if it's done ahead of time on the server, we'll have to perform a per-linked-resource authz filtering (e.g. the intermediate files for a dataset might be stored in a source you don't have permission to view).

Note that such a linking is cross-source, so if it's done ahead of time on the server, we'll have to perform a per-linked-resource authz filtering (e.g. the intermediate files for a dataset might be stored in a source you don't have permission to view).

I think this belies a misunderstanding here: that the Source classes are about everything in e.g. a bucket or a GitHub repo. They're very much not. They're just about the datasets and narratives (the Resource classes). I don't think we should be thinking about the intermediates we're trying to expose here thru the lens of our Source classes.

It's been a long time since I've been burned by MacOS' filesystem being case-insensitive, but it happens!

jameshadfield · 2023-07-13T04:19:49Z

src/sources/core.js

+      CoreCollectedResources : CoreStagingCollectedResources;
+    const datasets = new Map(), files = new Map();
+    s3objects.forEach((object) => {
+      const [name, resourceType] = CollectedResources.objectName(object);


Grouping of resources is done by those which have the same name, so there are decisions to be made around what name to give things. For instance, I think it's clear that we should group datestamped-datasets with their non-datestamped selves. We should probably also group .zst and .gz files together (although here I don't). Would be great to have others' thoughts.

For instance, I think it's clear that we should group datestamped-datasets with their non-datestamped selves.

See my comment elsewhere (and earlier), but I don't agree this is clear!

tsibley · 2023-07-13T19:31:59Z

Mostly a note to self. Things to make sure to review here based on the Blab Nextstrain meeting just now:

Lifecycle policies for s3://nextstrain-inventories
Terraform config for s3://nextstrain-inventories
Terraform config for IAM policies
How does/can this relate to resource-listing endpoints for use by RESTful API clients?
Lifecycle policies for other S3 buckets with trial/test/dev stuff… what are we missing?

jameshadfield · 2023-07-13T20:08:52Z

src/sources/core.js

+     * We can perform authorization here by either calling 
+     * authz.authorized(user, authz.actions.Read, object),
+     * See note above for more details.
+     */


@tsibley threading the authz conversation here. For our other APIs, the authz check is done on a single resource (Resource subclass instance). Here we have a collection of resources, which may include resources with different filenames etc.

We can easily perform an authorization check on the CollectedResources instance, but to stay properly aligned with the current approach the check should potentially be on each individual resource within the collection. If the latter is indeed what we want to do it would be another reason to extend the Resource subclasses such that the individual resources within a collection are themselves instances of Resource.

I think it would be a mistake to treat /x/y/z and /x/y/z/YYYY-MM-DD as the same thing in the current data model. And thus a mistake to authz them as the same thing too. To do so would be to dramatically undermine the current data model and increase complexity of reasoning about authz.

I know we think of these two things as the same, but it's complicated (as you've seen!) to do so systematically and thus programmatically. Instead, I think we should only rely on more systematic versioning like S3's (and not our ad-hoc dated forms) when collecting the version history of resources. That is, we should only consider the S3 version history of /x/y/z and not include the /x/y/z/YYYY-MM-DD dated forms in the history of /x/y/z. This means downplaying or filtering out the dated forms, but I think that's acceptable (?) since the S3 version history still exists and (thanks to your work here!) we'll have access to the systematic versioning going forward.

tsibley

I read thru this linearly, leaving comments on specifics I saw as I went. I mostly skipped over the UI bits as I wanted to focus on the bigger picture. The major takeaways for me:

I don't think the internal design/data model here is the right one, as I elaborated elsewhere.
I don't think we should be collapsing /x/y/z and /x/y/z/YYYY-MM-DD, as I elaborated on elsewhere. I don't think this would be a loss of any actual versions, and it makes the conceptual integration with the codebase a lot lot simpler.
I think the collection/indexing should be happening in a separate process, external from the web server, producing an artifact (a giant JSON blob, a SQLite database, etc) that's consumed by the web server.
A lot of the code seems to me to suffer from accidental complexity that's not been managed away/minimized. As accidental complexity builds up it drags down a codebase and makes things harder to understand/modify/maintain. I think of it a bit like barnacles or other biofouling on a ship. I realize some of this may be attributable to being a WIP/prototype, but I think it'll be important to address beyond that point rather than forget about.

Those are all really frank, but don't get me wrong, I think we absolutely want the functionality you've prototyped here! I think it will be an important set of features and foundational to future work, enough that we'll really want it on solid footings.

This might be a good focus (like we've been discussing) for the team to collaborate on? Or at least you and I? At least I'd be interested in elaborating in code instead of just commentary. And maybe you'd be interested in further refining the UI-side in parallel?

tsibley · 2023-08-03T23:00:10Z

src/metaSources.js

@@ -60,7 +60,7 @@ const Groups = (() => {
    updateKnownDatasetsAndNarratives,
    1000 * 60 * 60 * 6 // 6 hours
  );
-  updateKnownDatasetsAndNarratives(); // initial collection on server start
+  // updateKnownDatasetsAndNarratives(); // initial collection on server start


What's the need for this temporary [drop] commit?

tsibley · 2023-08-03T23:05:14Z

src/endpoints/charon/getAvailableV2.js

+ * The entire set of these resources is large, and this API hasn't been
+ * tested for performance. Currently the set is held in memory and
+ * filtering operations are performed for each request.


For prototyping this is fine, for sure, but I think we should consider other implementations here (SQLite) before we go much further.

tsibley · 2023-08-03T23:06:32Z

src/endpoints/charon/getAvailableV2.js

+ * Handler for /charon/getAvailable/V2 requests
+ * 
+ * WIP: The getAvailable API design should be flexible enough to access
+ * resources across the nextstrain ecosystem. This is a more flexible
+ * definition of resource than our Resource class as this may include
+ * intermediate files (sequences, alignments, metadata) which are not
+ * fetched by Auspice. 


Why is this endpoint part of the Charon API, which is tightly coupled with Auspice, instead of elsewhere?

tsibley · 2023-08-03T23:08:19Z

src/endpoints/charon/getAvailableV2.js

+        [resourceType, resources.map((resource) => {
+          return {
+            name: resource.name,
+            url: resource.nextstrainUrl(),


I don't know what nextstrainUrl is here (haven't read there yet), but it's odd to me to have "nextstrain" in the property name. My guess is that url is the internal URL and "nextstrain" means the URL on nextstrain.org? If so, maybe call it publicUrl instead? Or if it's really a path fragment and not a URL, then path?

tsibley · 2023-08-03T23:13:42Z

src/sources/collectedResources.js

+     * `nextstrainUrl` method on the respective source.
+     */
+    this._source = source;
+    [this._name, this._resourceType] = this.constructor.objectName(objects[0])


Basing the properties of a collection on its first element is a common source of bugs (e.g. when that first element is non-representative of the collection as a whole). It would be better to be explicit.

tsibley · 2023-08-04T01:20:01Z

static-site/gatsby-node.js

+        createPage({
+          path: "/pathogens/inputs",
+          component: path.resolve("src/sections/remote-inputs.jsx")
+        });


The stuff under files/workflows/… and files/datasets/… is explicitly not just "inputs", so even though this page is preliminary I think we should be calling it something else.

tsibley · 2023-08-04T01:29:47Z

src/utils/inventories.js

+      if (err) reject(err);
+      const orderedKeys = data.Contents
+        .map((object) => object.Key)
+        .filter((key) => key.endsWith('/manifest.json'))


It would be better to explicitly match and filter on the whole key, e.g. (psuedo-code, needs regexp metachar escaping)

.filter(key => key.match(new RegExp(`${prefix}/\d{4}-\d{2}-\d{2}T\d{2}-\d{2}Z/manifest[.]json$`)))

or the equivalent behaviour via another approach.

Otherwise it's easy for this to start mismatching unexpectedly in the future.

tsibley · 2023-08-04T01:35:31Z

src/utils/inventories.js

+      /* NOTE - the Body property of the response is a Buffer in S3 SDK v2 (which we use)
+         but this has changed to a readable stream in SDK v3 */


We previously used v3 in other parts of the codebase, but that's (IIRC) since been removed with the wrap up of the groups migration. I think we should be using v3 in new code (which is why I had done so previously for the groups migration tooling).

tsibley · 2023-08-04T01:37:19Z

src/sources/core.js

+      CoreCollectedResources : CoreStagingCollectedResources;
+    const datasets = new Map(), files = new Map();
+    s3objects.forEach((object) => {
+      const [name, resourceType] = CollectedResources.objectName(object);


For instance, I think it's clear that we should group datestamped-datasets with their non-datestamped selves.

See my comment elsewhere (and earlier), but I don't agree this is clear!

tsibley · 2023-08-04T01:39:36Z

src/sources/collectedSources.js

+  // The following is a MVP to re-collect every 24 hours
+  setInterval(
+    async () => {
+      for (const [,source] of sources) {
+        // collect resources (fetch, parse, store) sequentially
+        await source.collectResources();
+      }
+    },
+    1000 * 60 * 60 * 24 // 24 hours in ms
+  )


I don't think we should be doing this sort of collection/indexing inside the web server process.

… with changes made by James on the AWS Console. I copied the policy¹ as JSON from AWS Console and pasted directly into the file. After reordering to match the existing file contents, which are sorted to be more readable rather than alphabetical, I confirmed that this updated version of the file results in no changes with `terraform plan`. ¹ arn:aws:iam::827581582529:policy/NextstrainDotOrgServerInstance-testing Co-authored-by: James Hadfield <[email protected]>

jameshadfield · 2024-04-08T01:57:11Z

Replaced by #803 and #719

jameshadfield added 11 commits July 6, 2023 12:17

[drop] disable s3 listing on server start

a3d863f

API to list all available resources

94c100e

Sets out the main design ideas behind the collection of resources by the server and making these available via a new API. See commentary added in code for more details.

[drop] disable original table

d48d0e6

Add loading spinner

02fe52f

The spinner design is copied from multiple other Nextstrain projects. (I was surprised this wasn't already part of nextstrain.org)

Fetch latest inventory from S3

99c2c83

The previous dev-only approach is preserved under a `LOCAL_INVENTORY` env flag.

jameshadfield requested a review from a team July 13, 2023 03:13

nextstrain-bot temporarily deployed to nextstrain-s-list-s3-re-epcjm5 July 13, 2023 03:13 Inactive

jameshadfield commented Jul 13, 2023

View reviewed changes

[fixup] case sensitive filename

cac3780

It's been a long time since I've been burned by MacOS' filesystem being case-insensitive, but it happens!

nextstrain-bot temporarily deployed to nextstrain-s-list-s3-re-pmdthy July 13, 2023 03:43 Inactive

jameshadfield commented Jul 13, 2023

View reviewed changes

tsibley self-requested a review July 13, 2023 19:32

jameshadfield commented Jul 13, 2023

View reviewed changes

tsibley reviewed Aug 4, 2023

View reviewed changes

victorlin mentioned this pull request Oct 4, 2023

Update dev server IAM policy #718

Closed

2 tasks

victorlin force-pushed the list-s3-resources branch from aed9400 to a00b8e1 Compare October 4, 2023 18:08

victorlin mentioned this pull request Oct 4, 2023

Allow versioned resource access for core datasets #719

Merged

6 tasks

trvrb temporarily deployed to nextstrain-s-list-s3-re-quummv October 23, 2023 18:57 Inactive

jameshadfield closed this Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List datasets + files from core + staging sources #700

List datasets + files from core + staging sources #700

jameshadfield commented Jul 13, 2023

jameshadfield Jul 13, 2023

jameshadfield Jul 13, 2023

jameshadfield Jul 13, 2023

tsibley Aug 3, 2023

jameshadfield Jul 13, 2023

tsibley Aug 4, 2023

tsibley commented Jul 13, 2023 •

edited

Loading

jameshadfield Jul 13, 2023 •

edited

Loading

tsibley Aug 3, 2023

tsibley left a comment

tsibley Aug 3, 2023

tsibley Aug 3, 2023

tsibley Aug 3, 2023

tsibley Aug 3, 2023

tsibley Aug 3, 2023

tsibley Aug 4, 2023

tsibley Aug 4, 2023

tsibley Aug 4, 2023

tsibley Aug 4, 2023

tsibley Aug 4, 2023

jameshadfield commented Apr 8, 2024

		/* NOTE - the Body property of the response is a Buffer in S3 SDK v2 (which we use)
		but this has changed to a readable stream in SDK v3 */

List datasets + files from core + staging sources #700

List datasets + files from core + staging sources #700

Conversation

jameshadfield commented Jul 13, 2023

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsibley commented Jul 13, 2023 • edited Loading

jameshadfield Jul 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsibley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameshadfield commented Apr 8, 2024

tsibley commented Jul 13, 2023 •

edited

Loading

jameshadfield Jul 13, 2023 •

edited

Loading