Cataloging many Intake-ESM datastores #613
dougiesquire
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I’m curious about approaches for cataloging large numbers of Intake-ESM datastores. The conventional approach of nesting Intake YAMLFileCatalogs (e.g. the Pangeo Intake catalog) works great when there are only a few Intake-ESM datastores with a clear hierarchy, but data search/discovery is pretty limited when there are many datastores. My experience is that, to some extent, users have to know what they’re looking for in order to be able to use the catalog effectively.
My simple attempt to try and improve user experience was to write a new Intake plugin called intake-dataframe-catalog that provides a tabular catalog of Intake sources and associated metadata. The design and API is inspired by Intake-ESM, but the entries in an intake-dataframe-catalog are other Intake sources (e.g. Intake-ESM datastores). Similar to the way that users filter for datasets using Intake-ESM, users can filter on metadata in an intake-dataframe-catalog and eventually open the sources that are of interest to them.
Here’s the intake-dataframe-catalog documentation: https://intake-dataframe-catalog.readthedocs.io/en/latest/?badge=latest
And here’s an example of an intake-dataframe-catalog of many Intake-ESM datastores: https://access-nri-intake-catalog.readthedocs.io/en/latest/usage/quickstart.html
(unfortunately only those with access to Australia’s supercomputer Gadi can actually use this catalog)
This post is partly to make people aware of intake-dataframe-catalog and partly to see if there are other approaches out there for solving this same issue?
Beta Was this translation helpful? Give feedback.
All reactions