Replies: 6 comments
-
Couple follow up thoughts.. Another benefit of having a reference dataset collection like this is that, for all of the "supported" datasets for which the repo contains a working data package, one could easily write a script to traverse the directory structure, parse the If enough of the reference data packages encoded information about the variables represented by the different axes of the data, then one could also group together datasets indexed or construct a network depicting the relationship between the reference datasets. This could also fit in nicely with efforts to create useful react, etc. components for rendering data packages. |
Beta Was this translation helpful? Give feedback.
-
Thanks for spearheading this issue @khughitt! I don't have much to add, except I like the idea of making it possible to create "views" into the collection! We might also facilitate this with some custom field props that would allow us to tag / label fields for aggregation in different ways. |
Beta Was this translation helpful? Give feedback.
-
While I like the idea of showing what data/domains Data Package can support, I'm worried about maintenance. In my experience, if the owner of an uploaded example dataset leaves the project, it is typically orphaned, with the remaining maintainers not having enough familiarity to know why it was added and how to maintain it.
|
Beta Was this translation helpful? Give feedback.
-
The main motivation is not so much to show what kinds of data Data Package supports (that's sort of a bonus side effect), but rather, the goal is to help us to think clearly about the intended scope of the project and, as much as it is possible, to figure out ahead of time what types of structures are going to give us the greatest representative power down the road.. My worry is that, while the original aim of the project (and one that I think is achievable) a truly abstract container for data of all types, due simply to a biased set of viewpoints among the early drivers of the spec / an emphasis on one particular type of data (tabular data), we might end up with something that is really great for tables, and perhaps more cumbersome / less suitable for some other data types. I think your points are valid though and I think the second suggestion (creating a set of test datasets) is a much more reasonable goal. |
Beta Was this translation helpful? Give feedback.
-
Also, in the original issue description I mixed together two ideas that I think could be helpful to explicitly separate out:
(A third consideration might be file format: e.g. CSV, Parquet, FITS, HDF5, etc.., but that is related to data structure and also easier to modify support for down the road..) I think both are useful to think about and try and plan for, but the first one will probably have a larger impact on the frictionless codebase and specs, and likewise, be harder to change once we have gone too far down the road with some particular set of assumptions about what data looks like. |
Beta Was this translation helpful? Give feedback.
-
I really like the idea of test / synthetic datasets to exercise frictionless features as well as provide examples. Regarding data generation, I have some bits of code for that here that could be adapted. I also have an (unpublished as of yet) typescript version. |
Beta Was this translation helpful? Give feedback.
-
In order to help us think clearly about exactly what types of data data package is intended to support either in presently or in the future, it could be helpful to create a repo with example datasets of different types.
This will help both when thinking about what the spec should look like to properly represent all of the intended types of data, and provide a useful resource for writing test code.
For new users coming to Frictionless and wonderful whether it supports their data type, this could also be a good way to get started.
Repo structure
My first thought was to organize the repo by data type/modality (e.g. table, image, spectral, etc.), but actually, it might be better to do it by domain?
This way things are grouped together logically, in a way one it more likely to encountered them in the wild, and it would allow people working in the various domains to see, at a glance, which of the data types they tend to work with are represented?
There are obviously a bunch of different ways one could organize things. There may also be existing taxonomies of data types / domains that we could work off of.
I would be careful not to worry to much about getting "the" right structure because I don't think there is going to be a single one that works well for everything.. Instead, let's just get something started, and then iterate on it and improve it as our understanding of the intended scope of the project evolves.
Dataset directory contents
(e.g. "status: xx")
How to go about creating the repo?
Possible approach:
a possible directory structure/naming convention for how they can be organized
look like
add representative datasets with the appropriate licensing.
of the datasets.
Other considerations
@khusmann I couldn't pull up your comments from our Slack discussion about this a few months back, but I know you had some different ideas on this so please feel free to comment/share how you were thinking about this.
Anyone else is welcome to chime in, too, obviously. This is really just intended to get the ball rolling.
Beta Was this translation helpful? Give feedback.
All reactions