Improve the internal documentation and roadmap around metamodel profiles and monotonicity #1962

cmungall · 2024-03-09T00:36:03Z

cmungall
Mar 9, 2024
Maintainer

The concepts of metamodel profiles and monotonicity are linked, but first some general background. This is intended for people thinking deeply about the framework.

Currently the LinkML metamodel has a semi-formal notion of profiles, these are listed on the metamodel site, e.g https://linkml.io/linkml-model/latest/docs/BasicSubset/

Currently these are represented as subsets in the metamodel. Their current uses are primarily for documentation. The goal has always been to use these programmatically, to formally define profiles that can be used to declaratively answer the question of a piece of linkml tooling "does T support profile P?", and by extension "are there features of my schema that are not supported by tool T?".

For example, if I am using LinkML to describe a standard that is used in a stack that also uses protobuf, I would like to know if there are parts of my schema that cannot be represented in protobuf, so I can act accordingly (e.g. limit the expressivity of my schema, or modularize things such that the additional expressivity is an add-on and not required).

When thinking about target frameworks it's not a simple binary yes/no answer for whether a feature is supported. Broadly there is a spectrum:

The LinkML construct cannot be directly expressed in the framework
The LinkML construct can be expressed in the framework, but the framework is:
- more permissive
- same permissivity, but coercion/repair is performed automatically
The LinkML construct can be expressed in the framework, and the framework implements complete validation for that construct

For example, if you use any_of to express a variety of ranges, this cannot be directly expressed in the pure relational model. If you use is_a it cannot be directly expressed in many target frameworks (strict relational, jsonschema, ...).

However, in some cases there may be an entailment-preserving transform T that maps the schema to something that is expressible. This is exemplified by relmodel_transformer which performs well-understood rewrites to something that is expressable; e.g.

transforming multivalued into additional classes with backlinks
"rolling down" the is-a hierarchy

These transforms are not restricted to any one generator such as sqlddl. logical_model_transformer implements standard logic rewrites to obtain a "normal form" schema, where hierarchies are translated to all_of and simplification rules are applied to reach normal form (unsatisfiable classes are also detected this way).

So each of the options 1-3 above has sub-options for direct vs indirect.

It gets more nuanced. In some cases a transformation may be possible, but still undesirable as it may introduce non-isomorphism between frameworks. Consider again the case of mapping to the pure relational model. The rewrite for mapping multivalued to a backreference on a linking table is transparent to a user, they can use the generated sqlalchemy classes as if they were (largely) the same as classes generated by pydanticgen (because SQLA has a mechanism for equivalent rewrites). However, in contrast, if our schema has a slot s with a range int | string this could be rewritten as two slots (s_int and s_string); however, AFAIW this mapping "leaks out" and the object models is no longer isomorphic with the pydanticgen one.

This brings us back to monotonicity. Most frameworks can only express - directly or indirectly - a subset of the LinkML metamodel. It is acceptable for a generator to implement an incomplete mapping, it should never implement an invalid mapping; it should produce no new entailments.

A consequence of this is that there should be no overrides. Being aware of this helps mentally reason over some linkml behavior.

For example, people coming to the language may end up advertently or inadvertantly doing something like this:

  my_slot:
    range: string
    any_of:
      - range: int
      - range: C

The interpretation they want is that the any-of takes precedence. However, this is non-monotonic, i.e introduces overrides. To see why this is desirable, consider a generator that can't directly translate the any_of construct. It would still interpret the range construct. But the result would be more restricted than the intended model.

The correct interpretation of the above is that all constraints are applied, and as a result the slot is unsatisfiable, and should be reported as such.

instead what the user should do here is layer on constraints, e.g.

  my_slot:
    range: Any
    any_of:
      - range: int
      - range: C

In this case a generator that does not translate any_of still produces a target representation that is valid, it is just less complete.

Ultimately we want a more declarative way to represent profiles supported by generators, such that this is more transparent. The simplest way to do this is via a feature matrix of language construct x generator, where the values are from an enum such as DirectSupport, Mappable, Coerced, Ignored, ...

This is essentially what https://github.com/orgs/linkml/discussions/1549 does. However, it uses a strategy of pytest combinatorics plus programmatic X-ing out of the matrix, with results written in a format that could be used to generate a website showing the natrix. This has some advantages but is ultimately a bit unsatisfying since really we want the generators to be introspectable with what is supported. I think we can move towards this strategy incrementally.

This is what I am thinking, but this may be overly complicating things:

The first level of specification is the metamodel itself where we have profiles similar to what we have now
A generator can declare itself to represent some boolean combination of profiles plus or minus individual features
A further level of overrides can be implemented at the compliance test level

This maximizes DRY, no need to maintain large feature matrix profiles for each generator, just maintain where support deviates from the profile.

The fact that some generators are parametrizable complicates the picture a bit. E.g. sqlddlgen supports more features when pg is the target database than when sqlite is. OWL-DL can't support Any without punning unless the user chooses type-objects. But I think here we assume the maximal profile and allow individual options to document their own exceptions.

Finally, when I say "constructs" I don't just mean individual metamodel elements. If you look at the existing compliance tests, there are combinations of elements that constitute an individual feature, so we have to account for that.

sneakers-the-rat · 2024-03-09T01:52:46Z

sneakers-the-rat
Mar 9, 2024
Collaborator

Moving simultaneous comment from: #1961

Is your feature request related to a problem? Please describe.
Apologies if this has already been proposed elsewhere, i know we have talked about this before, and I did a bit of searching but didn't quite find what i was looking for. plz close this if i missed a matching issue.

The generators each support some different subset of the metamodel, and it can be tricky to know when you can rely on a generator to give you a faithful representation of a given schema. This also makes it difficult to track propagation of changes in the metamodel to the generators - the motivating example in this case being array support.

Generators should have some way to declare what features they support.

Challenges:

The metamodel isn't necessarily divided neatly into "features,"
some of the properties have combinatoric behavior (behave differently when used in conjunction with other properties).
Some functionality might be supported with a given parameterization of a generator rather than by the generator itself
features can be nested, like rn most generators support arrays = False, but then we might need to subdivide that into arrays: {anyshape: True, labeled: False} etc

Some related/illustrative issues:

Describe the solution you'd like

A start would be to add a classvar to Generator that's just like

class Condition(BaseModel):
    type: Literal['parameter', 'etc']
    key: str
    value: Any

class Feature(BaseModel):
    when: Optional[Condition] = None

class ArrayFeature(Feature):
    anyshape: bool = False
    labeled: bool = False

class GeneratorSupports(BaseModel):
    arrays: bool | ArrayFeature | list[ArrayFeature] = False

where we might allow something like

class PydanticGenerator:
    supports: ClassVar[GeneratorSupports] = GeneratorSupports(
        arrays = [
            ArrayFeature(when={'type':'parameter', 'key': 'array_representation', 'value': 'Numpydantic'}
                anyshape = True
                # ...
            ), # ...
        ]
    )

or we could just flatten the whole thing out. might be easier to start with that since it would be simpler.

Then we would be able to simplify all the special casing in the test_compliance suite (to pick a random example, the information that the SQL generator doesn't support enums is hardcoded here;

linkml/tests/test_compliance/test_enum_compliance.py

Line 215 in 0c3afa9

expected_behavior = ValidationBehavior.INCOMPLETE

), which seems pretty hard to maintain and document to me, even if i really like all the abstraction that went into it that works super well.

How important is this feature? Select from the options below:
• Low - it's an enhancement but not crucial for work

let's call this "would make our lives easier, but would require a decent amount of refactoring"

When will use cases depending on this become relevant? Select from the options below:
• Long-term - 6 months - 1 year

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linked data Modeling Language

Improve the internal documentation and roadmap around metamodel profiles and monotonicity #1962

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Linked data Modeling Language

Improve the internal documentation and roadmap around metamodel profiles and monotonicity #1962

cmungall Mar 9, 2024 Maintainer

Replies: 1 comment

sneakers-the-rat Mar 9, 2024 Collaborator

cmungall
Mar 9, 2024
Maintainer

sneakers-the-rat
Mar 9, 2024
Collaborator