Replies: 1 comment 1 reply
-
Count me in the data-first camp, i think blurring the division of labor between the schema (which should not know about its realization in different generated formats or in data) and data (which ideally shouldn't need to know or care about whether it is structured by a linkml schema or not, though that's less of a firm boundary to me) would limit the flexibility of both, and it could get muddy quick. First thing comes to mind is that even for a small dataset, you probably want to use a schema definition more than once, and then the notion of naming and reuse between schema definitions and data is a real challenge. From what i can tell from the dumpers and loaders, there isn't really a "linkml data format" per se? and the examples i have seen are json/yaml/csv files that are all intended to map onto a single container class. I think this would be an opportunity to be able to both, tidy data format and schema. Ideally we would be able to use yaml files with The main thing that i think is missing is a way to map linkml type to yaml data that's standard across linkml. i have been hacking around this and haven't arrived at something i'm satisfied with but here's a few desiderata
So say we start with a header as the single key that we reserve in a "linkml dataset." let's call that meta:
imports:
- linkml:types
is_a: linkml:dataset
id: my_dataset
slots:
my_slot:
range: string
another_slot:
range: int
third_slot:
range: float
an_array:
range: float
array:
max_dimensions: 4
classes:
MyClass:
slots:
- my_slot
- an_array
ChildClass1:
is_a: MyClass
slots:
- another_slot
ChildClass2:
is_a: MyClass
slots:
- third_slot
AnotherClass:
attributes:
an_attribute:
range: MyClass So then our dataset can use that directly, but we need to indicate which class it belongs to since there are multiple in play. Say we use some of those fancy YAML 1.2 features and annotate the type... (but i've only ever had really broken experiences trying to directly instantiate objects this way so we'll just treat it like metadata that our loader will handle). meta:
# (as above)
some_data: !MyClass
my_slot: "some data"
more_data: !ChildClass1
my_slot: "data again"
another_slot: 1 if you hate that, in all these cases imagine that as equivalent to more_data:
_type: ChildClass1 Great that's fine so far but what about ambiguous cases like meta:
# (as above)
third_data: !AnotherClass
an_attribute: !ChildClass2
my_slot: "again data"
third_slot: 2 OK what about files. this is the point where it would be necessary to do some imports, since ideally we dont have a) a fixed set of formats that can be put in a dataset, but make it possible for people to use whatever they have, but then b) we also don't necessarily want loose coupling where loading is left up to the consumer, so we would want a plugin system. Assume that takes the form of a metaschema that specifies a Now we can take advantage of how meta:
imports:
- from: example.com/matlab-loader
as: mat
slots:
my_slot:
# ... rest as above
some_data: !MyClass
an_array: !mat.MatFile
file: ./my_data.mat
selector: whatever{3}.screwy(2).matlab.syntax Then say later we want to reuse that schema in another dataset.... we could split the schema part out into a third file and import it from both, or maybe we could just import directly from that dataset meta:
imports:
- from: ./my_dataset.yaml
include:
- MyClass
- AnotherClass
- name: some_data
as: external_data
new_dataset: !AnotherClass
an_attribute: external_data anyway this gettin long, i have more sketches about instantiating more complex objects from nesty files, ids, but just to say if we do the extended import and namespacing syntax and then have a |
Beta Was this translation helpful? Give feedback.
-
In LinkML there is a clean separation of schema and data (of course, schema is just data conforming to the metamodel).
This makes it easy to create reusable schemas or standards that can be applied across multiple datasets.
However, sometimes it's convenient to have a bespoke schema for an individual dataset. These aren't typically referred to as schemas or data models, but perhaps as data dictionaries.
There are a number of frameworks that bundle these together
Of course this use case is already supported in LinkML in that you can always define a schema yaml for a one-off dataset in CSV, JSON, whatever. But it might be more convenient if these could be bundled into one file (no cheating - not a tar file..). It might also make it easier to map to the above frameworks.
Complicating the picture, these other frameworks sometimes contain ways of describing the dataset itself - how it was collected, who collected it. Sometimes this is generic, sometimes domain specific.
There are a few approaches here, depending on whether we consider the composed representation schema-first or data-first.
schema-first
In a schema-first approach, we would simply allow the schema to be the vehicle for dataset distribution. We could add slots to classes to indicate source of data (like csvw)
E.g.
This would be generic
If we wanted to capture additional metadata about the data set then this could go in the schema metadata itself but this is mixing concerns a bit and we obviously want to keep the schema metadata generic and not introduce domain specific modeling.
But this could be done with annotations; e.g.
the schema of the annotations would be encoded by a meta-schema, see https://linkml.io/linkml/schemas/annotations.html#validation-of-annotations
A variant of this is that a dataset schema inherits from the linkml metamodel
data-first
With this approach every community would define their own dataset schema (hopefully extending dcat/schema.org/d4d/etc), but we would have a mechanism for saying that parts of the data would map to a schema; e.g.
the schema for this dataset would map
uri_prefixes
tolinkml:prefixes
andfields
tolinkml:attributes
a variant of this is for the dataset schema to import the metamodel allowing reuse of its components
Beta Was this translation helpful? Give feedback.
All reactions