Requirements for a database interoperation layer #1818

cmungall · 2024-01-09T16:58:28Z

cmungall
Jan 9, 2024
Maintainer

LinkML provides a way of specifying the structure and semantics of data without committing to a particular technology or serialization format. It has been successfully integrated into different architectures that variously involved MongoDB, Neo4J, PostgresDB, triplestores, etc. However, in many of those projects there is still specific plumbing.

Would it make sense to generalize this into a common CRUD abstraction layer that would support

retrieval by ID
retrieval by query (e.g. simple mongo/chroma style query language)
views
delete
update
migrations
validation
etc

With bindings for different backends (including a simple in-memory one)

This would not sit in the linkml core but would be a purely optional additional layer, e.g. linkml-store

In fact, curategpt already does this, with a virtual store layer and a chromadb implementation - https://github.com/monarch-initiative/curate-gpt/tree/main/src/curate_gpt/store - it would be easy to make this a separate module. There would need to be thought given to scope (curategpt needs vector embeddings but this could be generalized to a general index structure)

WolfgangFahl · 2024-01-20T15:21:37Z

WolfgangFahl
Jan 20, 2024

Separation of concerns is IMHO a good idea. In my software archicture training i use a classical
(Picture from https://commons.wikimedia.org/wiki/File:File_cards_(8649760006).jpg)

as an example architecture. There you have cabinets per class/entity and cards per record/instance. The main interfaces are:

factory
query
collection

And if these interfaces can be made compatible with your local storage technology we are all set.

For queries i am a fan of named queries to avoid specifying the actual native query text and aim for compatibility again. Otherwise a simplte find_by_id() and find_by_key_value() interface already goes a long way.

0 replies

sneakers-the-rat · 2024-01-24T00:01:06Z

sneakers-the-rat
Jan 24, 2024
Collaborator

Im arguably writing something like this rn, and the idea of a generalized tool seems nice - maybe as a scoping/feasibility exercise it might be worth comparing the existing SQLAlchemy generator to eg something like this for triple stores (havent had much time to work on this yet, so wip)
https://github.com/p2p-ld/pydantigraph/blob/main/examples/model_basics.ipynb

Seems also related to: https://github.com/orgs/linkml/discussions/1820
If we were able to preserve the linkML representation in generated pydantic classes, we could virtualize operations for different DBs in a base model. Pydantic 2 adds a bunch of lovely tools for this, eg. See how I am adding function for a special array type with validation and serialization here: https://github.com/p2p-ld/nwb-linkml/blob/4ee97263ed4338a0cf76c19c23553038f85a9eae/nwb_linkml/src/nwb_linkml/types/ndarray.py#L8

And another example from a previous life, I didnt manage to get this on its feet, but eg. See how I was encoding extra metadata for how a given model field should behave with a particular serialization here:
https://github.com/auto-pi-lot/autopilot/blob/d7b3f7fa728ab7f10b4d3a69689687ef1994ec82/autopilot/tasks/nafc.py#L95

So currently all of these ORM tools do something that is unique to their underlying DB - SQLAlchemy and SQLModel use specific base classes, types, fields, and need extra fields for eg. Relations. But if one were to make an abstract LinkML Field class (or, in pydantic 2, this would probably be a special annotated type) that held all that metadata then we would be able to do that virtualization to the various db backends at runtime.

Another strategy is that most ORM tools have some notion of an engine or a session, so we could put the logic there of how to translate to different db backends.

I think its sort of a question "where" these models live - currently the strategy with the generators is to have several different types of models for different applications, and that is nice but its a lot of maintenance to keep feature parity between them. This seems like a way to consolidate some of those generators and dumper/loader functionality into a single package and maybe reduce the sprawl a bit. It could also help with documentation, because the current docs on how to actually use data with linkml are sort of scattered (understandably! There's a lot there!) And that is a real hindrance towards linkML being seen as a useful runtime tool (which it is! Once I made perf adjustments to schemaview and pydanticgen it became quite tractable to generate models on demand #1604 ), rather than just a schema modeling tool.

This also lines up well with my proposed hackashop project - id be down to be on the team or take initial lead on this if you dont already have someone working on this

0 replies

WolfgangFahl · 2024-01-24T04:39:25Z

WolfgangFahl
Jan 24, 2024

I am not too happy with the generated python classes and declarations along technical dependencies. In https://github.com/WolfgangFahl/pyLoDStorage/blob/master/lodstorage/sample2.py i am experimenting with a more "pythonic" approach. It allows to use standard python declarations and annotations. I'd love to be able to add the RDF/SPARQL or wikibase specific mapping information in an non-invasive way.

Also note the "specification by example" style

"""
Created on 2024-01-21

@author: wf
"""
from dataclasses import field
from datetime import date, datetime
from typing import List, Optional
import json
from lodstorage.yamlable import DateConvert, lod_storable

@lod_storable
class Royal:
    """
    Represents a member of the royal family, with various personal details.

    Attributes:
        name (str): The full name of the royal member.
        wikidata_id (str): The Wikidata identifier associated with the royal member.
        number_in_line (Optional[int]): The number in line to succession, if applicable.
        born_iso_date (Optional[str]): The ISO date of birth.
        died_iso_date (Optional[str]): The ISO date of death, if deceased.
        last_modified_iso (str): ISO timestamp of the last modification.
        age (Optional[int]): The age of the royal member.
        of_age (Optional[bool]): Indicates whether the member is of legal age.
        wikidata_url (Optional[str]): URL to the Wikidata page of the member.
    """

    name: str
    wikidata_id: str
    number_in_line: Optional[int] = None
    born_iso_date: Optional[str] = None
    died_iso_date: Optional[str] = None
    last_modified_iso: str = field(init=False)
    age: Optional[int] = field(init=None)
    of_age: Optional[bool] = field(init=None)
    wikidata_url: Optional[str] = field(init=None)

    def __post_init__(self):
        """
        init calculated fields
        """
        self.lastmodified = datetime.utcnow()
        self.last_modified_iso = self.lastmodified.strftime("%Y-%m-%dT%H:%M:%SZ")
        end_date = self.died if self.died else date.today()
        self.age = int((end_date - self.born).days / 365.2425)
        self.of_age = self.age >= 18
        if self.wikidata_id:
            self.wikidata_url = f"https://www.wikidata.org/wiki/{self.wikidata_id}"

    @property
    def born(self) -> date:
        """Return the date of birth from the ISO date string."""
        born_date = DateConvert.iso_date_to_datetime(self.born_iso_date)
        return born_date

    @property
    def died(self) -> Optional[date]:
        """Return the date of death from the ISO date string, if available."""
        died_date = DateConvert.iso_date_to_datetime(self.died_iso_date)
        return died_date


@lod_storable
class Royals:
    """
    Represents a collection of Royal family members.

    Attributes:
        members (List[Royal]): A list of Royal family members.
    """

    members: List[Royal] = field(default_factory=list)

    @classmethod
    def get_samples(cls) -> dict[str, "Royals"]:
        """
        Returns a dictionary of named samples
        for 'specification by example' style
        requirements management.

        Returns:
            dict: A dictionary with keys as sample names and values as `Royals` instances.
        """
        samples = {
            "QE2 heirs up to number in line 5": Royals(
                members=[
                    Royal(
                        name="Elizabeth Alexandra Mary Windsor",
                        born_iso_date="1926-04-21",
                        died_iso_date="2022-09-08",
                        wikidata_id="Q9682",
                    ),
                    Royal(
                        name="Charles III of the United Kingdom",
                        born_iso_date="1948-11-14",
                        number_in_line=0,
                        wikidata_id="Q43274",
                    ),
                    Royal(
                        name="William, Duke of Cambridge",
                        born_iso_date="1982-06-21",
                        number_in_line=1,
                        wikidata_id="Q36812",
                    ),
                    Royal(
                        name="Prince George of Wales",
                        born_iso_date="2013-07-22",
                        number_in_line=2,
                        wikidata_id="Q13590412",
                    ),
                    Royal(
                        name="Princess Charlotte of Wales",
                        born_iso_date="2015-05-02",
                        number_in_line=3,
                        wikidata_id="Q18002970",
                    ),
                    Royal(
                        name="Prince Louis of Wales",
                        born_iso_date="2018-04-23",
                        number_in_line=4,
                        wikidata_id="Q38668629",
                    ),
                    Royal(
                        name="Harry Duke of Sussex",
                        born_iso_date="1984-09-15",
                        number_in_line=5,
                        wikidata_id="Q152316",
                    ),
                ]
            )
        }
        return samples
    
@lod_storable
class Country:
    """
    Represents a country with its details.

    Attributes:
        name (str): The name of the country.
        country_code (str): The country code.
        capital (Optional[str]): The capital city of the country.
        timezones (List[str]): List of timezones in the country.
        latlng (List[float]): Latitude and longitude of the country.
    """
    name: str
    country_code: str
    capital: Optional[str] = None
    timezones: List[str] = field(default_factory=list)
    latlng: List[float] = field(default_factory=list)

@lod_storable
class Countries:
    """
    Represents a collection of country instances.

    Attributes:
        countries (List[Country]): A list of Country instances.
    """
    countries: List[Country]
    
    @classmethod 
    def get_countries_erdem(cls)->'Countries':
        """
        get Erdem Ozkol's country list
        """
        countries_json_url = "https://gist.githubusercontent.com/erdem/8c7d26765831d0f9a8c62f02782ae00d/raw/248037cd701af0a4957cce340dabb0fd04e38f4c/countries.json"
        json_str=cls.read_from_url(countries_json_url)
        countries_list=json.loads(json_str)
        countries_dict={"countries": countries_list}
        instance=cls.from_dict(countries_dict)
        return instance
        
    @classmethod
    def get_samples(cls) -> dict[str, "Countries"]:
        """
        Returns a dictionary of named samples
        for 'specification by example' style
        requirements management.

        Returns:
            dict: A dictionary with keys as sample names 
            and values as `Countries` instances.
        """
        samples = {
            "country list provided by Erdem Ozkol":
            cls.get_countries_erdem()
        }
        return samples
        
class Sample:
    """
    Sample dataset provider
    """

    @staticmethod
    def get(dataset_name: str):
        """
        Get the given sample dataset name
        """
        samples=None
        if dataset_name == "royals":
            samples = Royals.get_samples()
        elif dataset_name == "countries":
            samples=Countries.get_samples()
        else:
            raise ValueError("Unknown dataset name")
        return samples

I am in the process of writing a LinkML generator based on this: see https://github.com/WolfgangFahl/pyLoDStorage/blob/master/lodstorage/linkml_gen.py. If you are interested in the details give me a positive feedback on this comment and i'll open a new dicussion.

0 replies

cmungall · 2024-05-27T02:35:09Z

cmungall
May 27, 2024
Maintainer Author

See:

https://github.com/linkml/linkml-store

Discussed on the last community call (slides here: https://docs.google.com/presentation/d/e/2PACX-1vSgtWUNUW0qNO_ZhMAGQ6fYhlXZJjBNMYT0OiZz8DDx8oj7iG9KofRs6SeaMXBBOICGknoyMG2zaHnm/embed?start=false&loop=false&delayms=3000&slide=id.g1d28fec4213_0_150)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linked data Modeling Language

Requirements for a database interoperation layer #1818

{{title}}

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Linked data Modeling Language

Requirements for a database interoperation layer #1818

cmungall Jan 9, 2024 Maintainer

Replies: 4 comments

WolfgangFahl Jan 20, 2024

sneakers-the-rat Jan 24, 2024 Collaborator

WolfgangFahl Jan 24, 2024

cmungall May 27, 2024 Maintainer Author

cmungall
Jan 9, 2024
Maintainer

WolfgangFahl
Jan 20, 2024

sneakers-the-rat
Jan 24, 2024
Collaborator

WolfgangFahl
Jan 24, 2024

cmungall
May 27, 2024
Maintainer Author