Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List datasets + files from core + staging sources #700

Closed
wants to merge 13 commits into from
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@
# local sessions
/sessions/

# data for development purposes
/devData/

# Generated by scripts/collect-datasets.js
/data/datasets_influenza.json
/data/datasets_staging.json
Expand Down
80 changes: 80 additions & 0 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
"make-fetch-happen": "^10.0.0",
"marked": "^0.7.0",
"mime": "^2.5.2",
"neat-csv": "^7.0.0",
"negotiator": "^0.6.2",
"node-fetch": "^2.6.0",
"passport": "^0.4.0",
Expand Down
3 changes: 3 additions & 0 deletions src/app.js
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,9 @@ if (!PRODUCTION) {
app.routeAsync("/charon/getAvailable")
.getAsync(endpoints.charon.getAvailable);

app.routeAsync("/charon/getAvailable/v2")
.getAsync(endpoints.charon.getAvailableV2);

app.routeAsync("/charon/getDataset")
.getAsync(
endpoints.charon.setSourceFromPrefix,
Expand Down
56 changes: 56 additions & 0 deletions src/endpoints/charon/getAvailableV2.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
import { CollectedSources } from '../../sources/index.js';

/**
* Handler for /charon/getAvailable/V2 requests
*
* WIP: The getAvailable API design should be flexible enough to access
* resources across the nextstrain ecosystem. This is a more flexible
* definition of resource than our Resource class as this may include
* intermediate files (sequences, alignments, metadata) which are not
* fetched by Auspice.
Comment on lines +6 to +12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this endpoint part of the Charon API, which is tightly coupled with Auspice, instead of elsewhere?

*
* The entire set of these resources is large, and this API hasn't been
* tested for performance. Currently the set is held in memory and
* filtering operations are performed for each request.
Comment on lines +14 to +16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For prototyping this is fine, for sure, but I think we should consider other implementations here (SQLite) before we go much further.

*
* Note: we can set properties in layers of middleware, or in this handler
* since all of these properties are _only_ used by this handler
* it makes sense to place them here. If they become more widespread
* then it's easy to lift them to middleware and expose the data on context.
*
* The response is always JSON, which our frontend(s) will parse and display.
* The Accept header is not considered, however we could expand this to use
* `contentTypesProvided` and `contentTypesConsumed` to behave return different
* formats accordingly.
*
*/
const getAvailableV2 = async (req, res) => {

const availableResources = (new CollectedSources(req.user))
.resources() // TODO -- parameterise with filters

// console.log("availableResources", availableResources)

const available = Object.fromEntries(
Array.from(availableResources)
.map(([resourceType, resources]) =>
[resourceType, resources.map((resource) => {
return {
name: resource.name,
url: resource.nextstrainUrl(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what nextstrainUrl is here (haven't read there yet), but it's odd to me to have "nextstrain" in the property name. My guess is that url is the internal URL and "nextstrain" means the URL on nextstrain.org? If so, maybe call it publicUrl instead? Or if it's really a path fragment and not a URL, then path?

lastUpdated: resource.lastUpdated(),
versions: resource.versions.map((v) =>
[resource.lastUpdated(v), resource.nextstrainUrl(v)]
)
}
})])
)

const WARNING = "This API is currently for internal use only. The request and response schemas will change."
res.json({WARNING, available});
}

export {
getAvailableV2,
};

2 changes: 2 additions & 0 deletions src/endpoints/charon/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ import { splitPrefixIntoParts } from '../../utils/prefix.js';
import { setSource, setDataset, canonicalizeDataset, setNarrative } from '../sources.js';
import './setAvailableDatasets.js'; // sets globals
import { getAvailable } from './getAvailable.js';
import { getAvailableV2 } from "./getAvailableV2.js";
import { getDataset } from './getDataset.js';
import { getNarrative } from './getNarrative.js';
import { getSourceInfo } from './getSourceInfo.js';
Expand Down Expand Up @@ -54,6 +55,7 @@ export {
setNarrativeFromPrefix,

getAvailable,
getAvailableV2,
getDataset,
getNarrative,
getSourceInfo
Expand Down
72 changes: 72 additions & 0 deletions src/sources/collectedResources.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
/**
* CollectedResources represents a set of resources linked by being conceptually
* "the same". Sameness often means all available versions of the same filename
* but could also encompass different filenames, e.g. X and X_YYYY-MM-DD would
* be grouped together.
*
* The representation of resources themselves (`objects`) varies, but current implementations
* use plain objects or Maps. Methods to interact with the objects are defined by a
* subclass of `CollectedResources` which allows this flexibility in underlying data
* structure.
*
* There is a strong conceptual similarity between objects and Resource classes.
* Fundamentally Resource (sub-)classes are designed to map a nextstrain.org URL
* to the URLs which represent the associated files, however here we are going the other
* way: we want to map _files_ to their nextstrain.org URLs, where available.
* This includes concepts which Resources can't yet represent because they don't
* (yet) have associated nextstrain.org URLs -- concepts such as S3 object version.
* (GitHub-sourced files have similar issues where the branch (or commit) info is
* encoded in the Source, not the Resource.)
* While we can monkeypatch the instances, I think it's better to extend the Resource
* (sub-)classes if we want to go in that direction.
*
* Note that our RBAC authorization for resources currently uses ae policy from the
* underlying source + tags from the appropriate Resource instance; with the tags coming
* from the underling source + the resource type (currently Narrative | Dataset).
* Currently, if a user is allowed to access the source then they are allowed read access
* to all assets, so we don't perform any authorization at the resource level here.
* If this changes, or we want to do it for completness, we could easily extend the
* `authorized` function to set a policy for CollectedResources (via the associated source)
* and define `authzTags` for CollectedResources as applicable.
*/
class CollectedResources {
constructor(source, objects) {
/**
* Each entry in resources must define at least 'key', 'lastUpdated' and 'type'
* All other keys are implementation dependent as they will be parsed by the
* `nextstrainUrl` method on the respective source.
*/
this._source = source;
[this._name, this._resourceType] = this.constructor.objectName(objects[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basing the properties of a collection on its first element is a common source of bugs (e.g. when that first element is non-representative of the collection as a whole). It would be better to be explicit.

this._versions = objects
this._versions.sort((o) => this.lastUpdated(o))
}

get name() {return this._name}

get versions() {return this._versions;}

static objectName(object) { // eslint-disable-line no-unused-vars
throw new Error("static objectName(object) must be implemented by a subclass")
}

lastUpdated(version=this._versions[0]) { // eslint-disable-line no-unused-vars
throw new Error("lastUpdated(object) must be implemented by a subclass")
}

/**
* Filter currently returns a boolean, but this may be expanded upon once filtering
* is actually implemented
*/
filter() {
throw new Error("filter(args) must be implemented by a subclass")
}

nextstrainUrl(version=this._versions[0]) { // eslint-disable-line no-unused-vars
throw new Error("nextstrainUrl(object) must be implemented by a subclass")
}
}

export {
CollectedResources
};
72 changes: 72 additions & 0 deletions src/sources/collectedSources.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
import * as authz from '../authz/index.js';
import { CoreSource } from './core.js';

/**
* CollectedSources is a class which has access to all known resources
* (narratives, datasets, files) for all known sources. Instances of this class
* are used for requests for asset listings which may be within a single
* source, or may span multiple sources.
*
* The collection of assets is held in memory (via a closure) so that an
* instance of CollectedSources can access this information to respond to
* requests. The updating of the assets is deferred to each Source.
*
*/
const CollectedSources = (() => {

const sources = new Map([
['core', new CoreSource()] // move to Symbol('core')?
// TODO -- add core, all groups, fetch, community...
])

sources.get('core').collectResources()

// TODO XXX - collectResources() should be periodically run to update the sources.
// the cadence of this updating somewhat depends on how we are collecting resources
// (e.g. if the manifest updates daily, then we should re-collect daily). This could
// be implemented here or, perhaps better, within the collectResources() method itself.

class _CollectedSources {
constructor(user) {
this._user = user;
this._sources = new Map(
Array.from(sources)
.filter(([, source]) => authz.authorized(user, authz.actions.Read, source))
);
console.log("CollectedSources() user has access to:", Array.from(this._sources).map((x) => x[0]).join(", "));
}

/** TODO XXX - resources() should perform filtering on the available resources.
* The precise syntax of the filtering is API-dependent, but the concepts that could be
* applied here are:
* filter by source - somewhat analogous to `setSourceFromPrefix`
* filter by terms - somewhat analogous to pathParts, e.g. "influenza" AND "6m"
* filter by versions - boolean?
* filter by timeRange - ???
* filter by type - {dataset | file | narrative}
*/
resources() {
const resourcesByType = new Map();
this._sources.forEach((source) => {
source.allResources.forEach((resources, resourceType) => {
// console.log("CHECK", resourceType, resources.length)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but commented out code is a smell and I think we should avoid it.

// Filter all resources by requested resource type (TODO, see above)
if (resourceType!='dataset') return;
// add this source's filtered resources to the result
if (!resourcesByType.has(resourceType)) resourcesByType.set(resourceType, []);
resourcesByType.set(
resourceType,
resourcesByType.get(resourceType).concat(resources.filter((resource) => resource.filter()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a weird API design to me to have a resource.filter() method—i.e. to have the resource responsible for filtering itself out based on … external criteria?—instead of having the collection do the filtering of its contents based on the collection's criteria.

It's a design that leads to tight coupling and awkward data flows where everything must be threaded thru. For example, resource.filter() has no natural access to the current user, so can't naturally do the authz check discussed elsewhere.

)
})
})
// console.log("resourcesByType", resourcesByType)
return resourcesByType;
}
}
return _CollectedSources;
})();

export {
CollectedSources,
};
Loading