Best practices for embedding data dictionaries in datasets #2170

cmungall · 2024-06-21T18:31:25Z

cmungall
Jun 21, 2024
Maintainer

In LinkML there is a clean separation of schema and data (of course, schema is just data conforming to the metamodel).

This makes it easy to create reusable schemas or standards that can be applied across multiple datasets.

However, sometimes it's convenient to have a bespoke schema for an individual dataset. These aren't typically referred to as schemas or data models, but perhaps as data dictionaries.

There are a number of frameworks that bundle these together

Frictionless Add frictionless schema importer/exporter #861
EML Add EML (Ecological Metadata Language) importer to schema-automator #2168
CSVW Add generator for CSV on the Web (CSVW) #86
HDMF and HDF5 based frameworks
A dump of a SQL database containing both DDL and INSERTs

Of course this use case is already supported in LinkML in that you can always define a schema yaml for a one-off dataset in CSV, JSON, whatever. But it might be more convenient if these could be bundled into one file (no cheating - not a tar file..). It might also make it easier to map to the above frameworks.

Complicating the picture, these other frameworks sometimes contain ways of describing the dataset itself - how it was collected, who collected it. Sometimes this is generic, sometimes domain specific.

There are a few approaches here, depending on whether we consider the composed representation schema-first or data-first.

schema-first

In a schema-first approach, we would simply allow the schema to be the vehicle for dataset distribution. We could add slots to classes to indicate source of data (like csvw)

E.g.

classes:
  TemperatureRecording:
    data_files:  ## new metaslot
       - location: ./my.csv
    attributes:
       ...

This would be generic

If we wanted to capture additional metadata about the data set then this could go in the schema metadata itself but this is mixing concerns a bit and we obviously want to keep the schema metadata generic and not introduce domain specific modeling.

But this could be done with annotations; e.g.

id: temp-recordings
annotations:
  sampling_protocols: <domain specific model here...>
...
classes:
  TemperatureRecording:
    data_files:  ## new metaslot
       - location: ./my.csv
          annotations:
             collection_start_date: ... ## domain-specific
    attributes:
       ...

the schema of the annotations would be encoded by a meta-schema, see https://linkml.io/linkml/schemas/annotations.html#validation-of-annotations

A variant of this is that a dataset schema inherits from the linkml metamodel

data-first

With this approach every community would define their own dataset schema (hopefully extending dcat/schema.org/d4d/etc), but we would have a mechanism for saying that parts of the data would map to a schema; e.g.

id: my-dataset
sampling_protocols: ...
uri_prefixes: ## everything here must conform to linkml prefixes
data_files:
  - location: my.csv
     fields:   ## everything here must conform to linkml attributes

the schema for this dataset would map uri_prefixes to linkml:prefixes and fields to linkml:attributes

a variant of this is for the dataset schema to import the metamodel allowing reuse of its components

sneakers-the-rat · 2024-09-27T01:43:12Z

sneakers-the-rat
Sep 27, 2024
Collaborator

Count me in the data-first camp, i think blurring the division of labor between the schema (which should not know about its realization in different generated formats or in data) and data (which ideally shouldn't need to know or care about whether it is structured by a linkml schema or not, though that's less of a firm boundary to me) would limit the flexibility of both, and it could get muddy quick. First thing comes to mind is that even for a small dataset, you probably want to use a schema definition more than once, and then the notion of naming and reuse between schema definitions and data is a real challenge.

From what i can tell from the dumpers and loaders, there isn't really a "linkml data format" per se? and the examples i have seen are json/yaml/csv files that are all intended to map onto a single container class. I think this would be an opportunity to be able to both, tidy data format and schema.

Ideally we would be able to use yaml files with --- divisions between schema and dataset, but at least pyyaml and ruamel seem to forbid this within a single file not in a streaming context. Re: imports, i agree, and think that's the best path, because even with lightweight dataset you probably want to use some terms from somewhere else, even if they are just the linkml types.
This relates to the import system proposal i made awhile ago - ideally we could use the same syntax in both datasets and schema, but it becomes even more critical in the context of loading datasets to have multiple sources and be able to pull versions because we lose the luxury of "schemas are slow and rarely touched."

The main thing that i think is missing is a way to map linkml type to yaml data that's standard across linkml. i have been hacking around this and haven't arrived at something i'm satisfied with but here's a few desiderata

import schemas and built artifacts from potentially multiple sources (also related to [cli] linkml build #2263 , where it would be nice for a repo, or really any URI in a .well-known kind of way to advertise that it contains a schema, how to resolve its versions and artifacts, etc.)
have a variety of different classes at the root that can declare their type
a combination of literal and referenced (ie. in file) data
referenced data can populate one or several slots/classes
lazy IDs for entities within a dataset
...

So say we start with a header as the single key that we reserve in a "linkml dataset." let's call that meta. Let's start with the single-file case, where "meta" just behaves like a linkml schema. Say we allow the is_a metamodel slot to declare what kind of document something is, with the default being linkml:schema:

meta:
  imports:
  - linkml:types
  is_a: linkml:dataset
  id: my_dataset
  slots:
    my_slot:
      range: string
    another_slot:
      range: int
    third_slot:
      range: float
    an_array:
      range: float
      array:
        max_dimensions: 4
  classes:
    MyClass:
      slots:
      - my_slot
      - an_array
    ChildClass1:
      is_a: MyClass
      slots:
      - another_slot
    ChildClass2:
      is_a: MyClass
      slots:
      - third_slot
    AnotherClass:
      attributes:
        an_attribute:
          range: MyClass

So then our dataset can use that directly, but we need to indicate which class it belongs to since there are multiple in play. Say we use some of those fancy YAML 1.2 features and annotate the type... (but i've only ever had really broken experiences trying to directly instantiate objects this way so we'll just treat it like metadata that our loader will handle).

meta:
  # (as above)
  
some_data: !MyClass
  my_slot: "some data"
  
more_data: !ChildClass1
  my_slot: "data again"
  another_slot: 1

if you hate that, in all these cases imagine that as equivalent to

more_data:
  _type: ChildClass1

Great that's fine so far but what about ambiguous cases like AnotherClass where we might not know what the attribute class is...

meta:
  # (as above)
  
third_data: !AnotherClass
  an_attribute: !ChildClass2
    my_slot: "again data"
    third_slot: 2

OK what about files. this is the point where it would be necessary to do some imports, since ideally we dont have a) a fixed set of formats that can be put in a dataset, but make it possible for people to use whatever they have, but then b) we also don't necessarily want loose coupling where loading is left up to the consumer, so we would want a plugin system. Assume that takes the form of a metaschema that specifies a Loader class that provides some read method, and we have declared one of those and linked it to a python class.

Now we can take advantage of how meta is just a linkml schema and import the schema (say, using the extended import syntax that supports namespacing)

meta:
  imports:
    - from: example.com/matlab-loader
      as: mat

  slots:
    my_slot:
  # ... rest as above
  
some_data: !MyClass
  an_array: !mat.MatFile
    file: ./my_data.mat
    selector: whatever{3}.screwy(2).matlab.syntax

Then say later we want to reuse that schema in another dataset.... we could split the schema part out into a third file and import it from both, or maybe we could just import directly from that dataset

meta:
  imports:
    - from: ./my_dataset.yaml
      include:
      - MyClass
      - AnotherClass
      - name: some_data
        as: external_data

new_dataset: !AnotherClass
  an_attribute: external_data

anyway this gettin long, i have more sketches about instantiating more complex objects from nesty files, ids, but just to say if we do the extended import and namespacing syntax and then have a meta-like block as a header that is just a linkml schema, then we get some pretty interesting recursion there i think that can span both the single-file small schema/dataset combo case and also all data expressed in linkml

1 reply

sneakers-the-rat Sep 27, 2024
Collaborator

i didn't even touch the reason for the "being able to import from multiple kinds of sources" lol.

so say the matlab loader is a python package on pypi, you might want to fetch the schema sometimes and get the classes that concretize it (ie. to load with) other times. given some canonical way of indicating built artifacts and schema in a python package, one could do

meta:
  imports:
    mat:
      from: 
        - uri: example.com/matlab-loader
        - type: pypi
          package: matlab-loader
          version: 1.2.3

so the imports become a way of bringing code, schemas, and data into a scope

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linked data Modeling Language

Best practices for embedding data dictionaries in datasets #2170

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Linked data Modeling Language

Best practices for embedding data dictionaries in datasets #2170

cmungall Jun 21, 2024 Maintainer

schema-first

data-first

Replies: 1 comment · 1 reply

sneakers-the-rat Sep 27, 2024 Collaborator

sneakers-the-rat Sep 27, 2024 Collaborator

cmungall
Jun 21, 2024
Maintainer

Replies: 1 comment 1 reply

sneakers-the-rat
Sep 27, 2024
Collaborator

sneakers-the-rat Sep 27, 2024
Collaborator