Iterating on linkml-project-cookiecutter with usability feedback? #2203

sierra-moxon · 2024-07-11T18:36:12Z

sierra-moxon
Jul 11, 2024
Maintainer

During the ISMB 2024 tutorial, we noticed some teachability challenges with our linkml-project-cookiecutter. This discussion is meant to collect ideas on refactoring, replacing, or adding another cookiecutter for new users.

To get the discussion started, some feedback from the tutorial :

Our cookiecutter is excellent at demo'ing the capabilities of our framework. Out of the box it helps users generate a python development environment, provides some control mechanisms over the generation of various linkml serializations, it gives the user an easy path to pypi publishing and doc generation via GH actions templates, and it sets up a testing framework based on examples.

Depending on the level of expertise of the user, however, the resulting project does have a few gotchas:

We provide three ways to generate model serializations: gen-project, makefile targets, and poetry commands. We don't necessarily give users a clear idea of what kinds of work should be delegated to which of these generation strategies.
We assume that users will want to generate many serializations of their model when most of our flagship projects only use a few.
We assume that users will write python code to test the model against test data.
Cruft is hard to explain.
We don't provide the documentation Jinja templates in the cookiecutter.
We provide an example schema that has to be edited. A user needs to know what to remove without removing necessary boilerplate (e.g. "classes", "slots", "imports")
We have cookiecutter prompts (e.g. questions about schemasheets, pypi keys, and GH tokens, etc) that are difficult for a new user to understand (and many which they probably do not need).
The tests we provide tend to fail without the user having additional understanding of the model being created.
It's hard to remember the list of possible generators and how to call them.

sierra-moxon · 2024-07-11T18:40:56Z

sierra-moxon
Jul 11, 2024
Maintainer Author

We started this conversation on slack, and @sneakers-the-rat had some good suggestions:

I think we should generally refactor the cli entrypoints so that they all stem off a single linkml command - we already have all the click stuff in place for this, would just need a refactor into something that could easily be backwards compatible too. so eg. the existing scripts might work like this:

list available generators

linkml generate --list

use a generator, eg. instead of gen-pydantic

linkml generate pydantic {args}

some other examples

linkml convert
linkml lint
linkml validate jsonschema

Then we could replace the cookiecutter as a separate repo with a command that does an interactive prompt by default (or else accepts args for noninteractive).

linkml init

I think the configuration should all be moved to a linkml.yaml file that lives at the root of a repository and corresponds to a pydantic/dataclass model.

One of the major problems with the cookiecutter is that it does too much - I think it’s probably more common that someone wants to use linkml in a relatively constrained context/within another project vs. create a totally independent project that uses all possible generators. As is, the cookiecutter basically tries to be an entire python packaging system, which is just a lot to take on.

So then we could have some project config like:

build config for all generators

build:

input: my_main_schema.yaml
output: module/{generator}/{name}.{suffix}

configure specific generators

generators:
  pydantic:
    enabled: true
    options:
      black: true
      # etc. options for all builds
    build:
    - input: my_schema.yaml
      output: module/my_models.py
      post_build: my_postbuild_script.py
      options:
        # options for this build
        meta: null
    - input: my_other_schema.yaml
      output: module/

  # eg to just use default settings
  jsonschema: 
  jsonld:
  owl:

lint:
  rules:
    recommended:
      level: error

and so on.

Then we can use more familiar idioms like linkml build / linkml build --config linkml_config.yaml to build all configured input/output pairs per generator. What I have really been wanting is linkml watch which detects when the input files have changed and regenerates the outputs.

Having a declarative config like that would also let us do stuff like have pre-commit hooks to ensure that models are up to date, etc.
I think that might be helpful with onboarding new ppl - “do a pip install linkml and then a linkml --help to see what you can do!”

As we all know one of the major challenges with this whole linked data thing is that “getting and maintaining a URI is hard” - it looks like w3id is relatively easy to make a new namespace under, or if we wanted to do something similar with the linkml registry that would be cool. something like linkml register w3id {schema} {namespace} to make your schema actually have a real URI (we could make it just point to the github repo raw URI for the schema in question by default).

1 reply

sneakers-the-rat Jul 11, 2024
Collaborator

I can draft this btw, @pkalita-lbl also expressed interest in writing this so I can open up a draft on a branch and we can work on it together? I think first phase of integrating the CLI can be purely additive, and then the second phase of refactoring cookie cutter can come in at any time

sneakers-the-rat · 2024-07-11T18:51:57Z

sneakers-the-rat
Jul 11, 2024
Collaborator

@pkalita-lbl also raised the point that we might not be sure about my hunch raised above, and I think thats a good point:

I think it’s probably more common that someone wants to use linkml in a relatively constrained context/within another project vs. create a totally independent project that uses all possible generators.

Would this be a place to get a sense of that?

I wonder how many projects work in the cookiecutter mode of having schema(s) and all generated models as a relatively independent project, vs. Using them in a more constrained capacity/integrating within another project? I also wonder if the former, whether that is a desired choice (which is fine!) Or a byproduct of the cookiecutter being a recommended entrypoint vs. A CLI tool for managing schemas not tied to a repo as I suggest above?

Welcome anyone to chime in here, if we've taken prior polls like this, or anyone who is currently using linkml feel free to chime in on how you use it now and what your ideal model would be re: tooling around schema and generator management :)

0 replies

sierra-moxon · 2024-07-11T20:58:50Z

sierra-moxon
Jul 11, 2024
Maintainer Author

Also - even if an organization is developing a LinkML as its own repo (the flagship-y schemas I contribute to are happy with an independent repo for LinkML schemas, so I would 👍 that design paradigm vs. one where the schema is incorporated into another application layer), it would be nice to know how many serializations (in practice) are consumed (and which ones). I feel like we could isolate generators for inclusion gen-project and the cookiecutter vs. those that can be easily added later on as needed.

some examples:

Project: Artifacts Used in Practice
Alliance: JSONSchema,Documentation
NMDC: YAML itself, JSONSchema, Documentation, gen-LinkML, Python, prefix-map
Biolink: YAML itself, Documentation, OWL, prefix-map

1 reply

sneakers-the-rat Jul 12, 2024
Collaborator

good point :) and ideally we don't have to choose, in your build config you can just arbitrarily refer to any generator that exists and give it parameters and inputs/outputs, so adding/removing formats is just modifying the config file, as opposed to needing to keep a cookiecutter template in sync with main repo.

basically just having one of these: https://github.com/linkml/linkml-project-cookiecutter/blob/main/%7B%7Bcookiecutter.project_name%7D%7D/config.yaml

without needing one of these: https://github.com/linkml/linkml-project-cookiecutter/blob/main/%7B%7Bcookiecutter.project_name%7D%7D/Makefile

dalito · 2024-07-11T21:26:20Z

dalito
Jul 11, 2024
Collaborator

On point 4: I found copier nicer to work with than the combo of cookiecutter and cruft. At the time of the comparison cruft did not work at all on windows (this is no problem anymore). But having one tool less to install/understand may be good.

1 reply

pkalita-lbl Jul 18, 2024
Maintainer

I think we should seriously consider this if for no other reason than it seems that cruft may not be actively maintained anymore (no commits for about a year as of writing this).

sneakers-the-rat · 2024-07-18T21:51:04Z

sneakers-the-rat
Jul 18, 2024
Collaborator

OK lemme take an accounting here of what all the cookiecutter does, and maybe that helps shape what form we want next version to look like:

Makes python package
- Declare package metadata (license, description, etc.)
- Create documentation from docs generator
- Poetry
  - dynamic versioning
- Deps
  - a very old version of linkml
  - mkdocs-material
- package structure: /project and somewhat nonstandard /src directory
Build management/Config
- bigass makefile as the main CLI
- .env.public file (expects there to be a single schema (LINKML_SCHEMA_NAME)
- config.yaml for projectgen
- hook to data-harmonizer
- build all formats with gen-project
- build/serve docs
- convert example data formats
Examples (things that will almost certainly be changed by package)
- Person schema
- Tests
- Person data
Integrations
- Google sheets/schemasheets
DevOps
- pypi deployment action
- deploy documentation
- run tests with makefile
Cruft Things
- Update project for new versions of the template

Thoughts on having a template...

So there are a number of things that are currently problems/dont work (eg. the generated python package doesn't have an __init__.py file and the pyproject.toml file doesn't configure packages so it imports but there's nothing in it), but i wonder if a cookiecutter/template repo is the kind of thing that we really want as an entrypoint/example for usage.

The template is relatively opinionated about the structure of a project - eg. using poetry, dynamic versioning, how docs are built and hosted, directory structure (eg. of projects, docs within src, etc.). That might be a useful starting point for people who have never made a python package at all before, but I also feel like it's biting off a lot, and may be mistargeted on both ends of the experience spectrum - making a python package might be a lot of complexity for new python devs to start, and experienced python devs will almost certainly have their own preferred style for all these things. It's also an intrinsically difficult thing to maintain since it has so much and it's probably not something the dev team is working with day to day, so it seems like it would always be at risk of getting out of date/forgotten about - it doesn't support a number of the generators, and the linkml-runtime version is way out of date - and would take a lot of work to get to a point where it's flexible enough to meet a reasonable array of use cases.

the thing that i find myself really liking about it and wanting more generally is the cli entrypoint of make gen-project to be able to keep my built objects in sync with my schema, but i don't necessarily see why i should need to use the whole cookiecutter template to get functionality like that - it seems like for new and experienced users of linkml and python alike i would want to be able to both a) start much more simply: assume i already have a schema (or have gotten one from the docs), make some default config.yaml file with maybe some chooser of formats and default output locations, and then be able to call linkml generate project from anywhere; and b) customize in a much more controllable way: my schemas are going to be named and located in a particular way, i will have some set of configuration options and customizations to the generators, etc. so I want to just be able to have a single config file and otherwise structure my code as i want to.

What else? I think there are a number of things that are like "i don't think we should try and do this," but others i'm not so sure and could go either way. Eg. I don't think we should handle python package generation, dependency management, docs building, CI/CD, include default schemas in the project, etc. I also think that the idea of being able to update a project with cruft or another template manager is interesting but i dont' see how that would go in practice, one would have to keep everything in the same structure as the template, but it would be somewhat unpredictable what matters and what doesn't, and i can imagine that being impossible for most projects almost immediately. But what of the above functionality aside from the build cli stuff do we want to keep?

Pitch for config + cli project manager

I will try not to repeat what i said above, but lemme add some additional thoughts:

so in any case i think all the configuration should be unified into a single place (that has its own linkml schema) rather than being spread out across config.yaml, Makefile, .env, pyproject.toml, cruft config, etc. But also all the ways we use yaml are representable in TOML as well. We could do both, similar to how pytest and other tools will look for config files in some separate .ini file but also check pyproject.toml for config in some hierarchy of priority. So one could have a [tool.linkml] section and be able to run linkml generate.

Another thing that is a bigger-picture need is a) schema discoverability, b) schema identification (ie. having a correct and stable URL), and crucially c) relationship between generated artifacts and the source schema. I had alluded to this in the february workshop, but it would be nice to have something like a .well-known-like system so it would be easy to know if a repository or website or whatever had any linkml schemas in it and where they were. If we were to have a stereotyped set of places where configs could go (pyproject.toml, or a linkml.yaml file at the repo root) that would be one part of that need that would both tell us where the linkml schemas are and how they relate to the built artifacts. SO you could imagine an import system (as i have also proposed previously: https://github.com/orgs/linkml/discussions/1739#discussioncomment-8600477 ) that imports from a repo (rather than from a single URL in class linked data style) to say "import schema_xyz from <repo_url> at this version" and then the maintainers could move it around, change it, etc. and you have a much more useful way of importing from other linkml projects than simply by URL which are much more likely to break, don't support versioning, and many schema simply don't have them in a meaningful way. AND that takes us one step closer towards a full python virtualization of linked data schema: pydantic generated models can already be losslessly inverted to a linkml schema (depending on your metadata mode ofc), and if it was possible to say "this module.submodule.ClassName is linked to schema/submodule.yaml according to the project configuration" and have that also be true for built artifacts in general, and fill in any missing parts the generators can't handle as well as make the generated artifacts true interfaces to linked data as well as useful ways of manipulating it.

i'll leave this very long post there for now bc i need to head out the door for something but just some thoughts from this morning

0 replies

dalito · 2024-08-19T01:13:27Z

dalito
Aug 19, 2024
Collaborator

Mainly to have better support for windows users, which typically don't have make, but also to make writing tasks easier, I explored some alternatives to make/makefiles. I submitted a PR that explores using duty which can be pipx-installed just like cruft.

While doing this I was also noticing that the commands are not that well named and it is quite unclear what they do in detail without looking into the code (=makefile). Also the use of an env.public file for some but not all parameters is a bit confusing.

0 replies

StroemPhi · 2024-10-04T09:37:50Z

StroemPhi
Oct 4, 2024
Collaborator

Hi, I just wanted to provide some feedback on this discussion as a "normal/noobie" user of LinkML.

I'm not a developer. I only have rudimentary Python skills, that will allow me to write scripts and basic packages. I like to learn how to do things by looking at concrete working examples. So an example repo with commented code, a default schema, a default data example, and a default test.py helps me to learn how to use LinkML and basic Python testing/packaging. The cookie-cutter template served me as such an example repo a lot along with consulting the docs.

If I understand the "config + cli project manager" discussed here correctly, then I also think this would make certain things much simpler. So I'd assume the template/example repo that I want/need could then consist only of an example config (with comments that tell me what each parameter in the config does) and a Readme that explains how to use the linkml build command to build my repo from this config.

But how would the testing and doc generation be handled then?

What I like about the cookiecutter template and would miss if this functionality would be gone:

There is an initial interactive prompt through which I can provide the basic details of my new schema upfront, like name, description, license, and root class.
That I can simply call make setup to have a "standard" repo seeded from what I provided in the initial interactive prompt
- or to re-initialize the repo by calling make setup when I feel like I messed up and want to start from scratch
That I can:
- call make test to check my schema against my example YAML continuously
- see changes in the docs locally by calling make testdoc.
That the docs will be updated and tests will be run automatically before I merge a PR. So having these GH actions/workflows predefined for me is crucial.

What I don't need/like:

That fact that I have to provide a GitHub repo/org/name upfront is unnecessary IMHO, where I want to host my repo (GitHub/GitLab, which org etc) should be left to me.
That fact that it is hard to understand currently which "config" files does what when calling Makefile commands

I hope this feedback helps. I really like LinkML and its community, so thanks for all your help and work!

1 reply

sneakers-the-rat Oct 4, 2024
Collaborator

thanks for the feedback :):)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linked data Modeling Language

Iterating on linkml-project-cookiecutter with usability feedback? #2203

{{title}}

Replies: 7 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Linked data Modeling Language

Iterating on linkml-project-cookiecutter with usability feedback? #2203

sierra-moxon Jul 11, 2024 Maintainer

Replies: 7 comments · 4 replies

sierra-moxon Jul 11, 2024 Maintainer Author

list available generators

use a generator, eg. instead of gen-pydantic

some other examples

build config for all generators

configure specific generators

sneakers-the-rat Jul 11, 2024 Collaborator

sneakers-the-rat Jul 11, 2024 Collaborator

sierra-moxon Jul 11, 2024 Maintainer Author

sneakers-the-rat Jul 12, 2024 Collaborator

dalito Jul 11, 2024 Collaborator

pkalita-lbl Jul 18, 2024 Maintainer

sneakers-the-rat Jul 18, 2024 Collaborator

Thoughts on having a template...

Pitch for config + cli project manager

dalito Aug 19, 2024 Collaborator

StroemPhi Oct 4, 2024 Collaborator

sneakers-the-rat Oct 4, 2024 Collaborator

sierra-moxon
Jul 11, 2024
Maintainer

Replies: 7 comments 4 replies

sierra-moxon
Jul 11, 2024
Maintainer Author

sneakers-the-rat Jul 11, 2024
Collaborator

sneakers-the-rat
Jul 11, 2024
Collaborator

sierra-moxon
Jul 11, 2024
Maintainer Author

sneakers-the-rat Jul 12, 2024
Collaborator

dalito
Jul 11, 2024
Collaborator

pkalita-lbl Jul 18, 2024
Maintainer

sneakers-the-rat
Jul 18, 2024
Collaborator

dalito
Aug 19, 2024
Collaborator

StroemPhi
Oct 4, 2024
Collaborator

sneakers-the-rat Oct 4, 2024
Collaborator