Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

composition: Enable layering of pallets #253

Closed
ethanjli opened this issue Jun 14, 2024 · 3 comments · Fixed by #307
Closed

composition: Enable layering of pallets #253

ethanjli opened this issue Jun 14, 2024 · 3 comments · Fixed by #307
Assignees
Labels
enhancement New feature or request

Comments

@ethanjli
Copy link
Member

ethanjli commented Jun 14, 2024

Currently, the only way to make a variant of a pallet is to fork it into a new repo and then synchronize changes between the upstream repo and the downstream fork; any changes from other upstream pallets must be manually copied from those upstreams, with no support in git for synchronizing future changes from those upstreams. Making variants of pallets would be easier if we could take a "layering" approach (like how container images can be composed by copying files from multiple other container images). Prototypical motivating use-cases include:

  • Moving generalizable host package deployments (e.g. for machine-name stuff, autohotspot stuff, etc.) from github.com/PlanktoScope/pallet-standard to github.com/PlanktoScope/device-pallet-base, and defining github.com/PlanktoScope/device-pallet-standard as a layer on top of it. github.com/PlanktoScope/device-pallet-base could be the pallet used by the PlanktoScope SD card images in the none hardware type, while github.com/PlanktoScope/device-pallet-segmenter-only could be the pallet used by the PlanktoScope SD card images in the segmenter-only hardware type.
  • Making github.com/PlanktoScope/pallet-standard (or github.com/PlanktoScope/device-pallet-standard) default to providing package deployments for the Planktoscope HAT, and then making another pallet (e.g. github.com/PlanktoScope/device-pallet-adafruithat) layered on top of it which overrides PlanktoScope HAT-specific package deployments with Adafruit HAT-specific package deployments
  • Making a custom pallet which tracks github.com/PlanktoScope/pallet-standard as its base but adds a few more package deployments, and maybe disables some unnecessary package deployments. In fact, in PlanktoScope SD card images we might want to just initialize a new pallet which merely declares github.com/PlanktoScope/pallet-standard as a required pallet, so that users can then add new package deployments (or override imported package deployments) without touching github.com/PlanktoScope/pallet-standard.

Just like how pallets have a requirements/repositories subdirectory which is manipulated with the [dev] plt add-repo and [dev] plt rm-repo subcommands, we can add a requirements/pallets subdirectory which also uses forklift-version-lock.yml files, and we can add [dev] plt add-plt and [dev] plt rm-plt subcommands. Note that we'd probably want to provide a way to easily (check for and) switch to newly-released versions of required pallets, as a parallel to #246. This could be implemented by manually adding /requirements/pallets/{pallet path}/forklift-updates.yml files, for example.

Then we need to figure out how to organize the inclusion of files (e.g. files/directories under requirements and/or files/directories under deployments) from required pallets, e.g. with one or more of the following options:

  • Mirroring the filesystem structure, we could allow listing files in forklift-includes.yml files which can exist anywhere. Wherever such a file exists, it can specify one or more file imports; each file import consists of:

    • the path of the file (which may be a directory) to import, prefixed with the path of the pallet which provides the file to import (e.g. github.com/PlanktoScope/pallet-standard/deployments/infra)
    • (optionally) a target path to import a file into, relative to the parent directory of the forklift-includes.yml file, if the target subdirectory path is different from the subdirectory path of the file to import relative to its parent pallet (e.g. the /deployments/forklift-includes.yml could specify the target as base-infra; or the /forklift-includes.yml file could specify the target as deployments/base-infra)

    This option could be very familiar with the syntax for specifying file exports from Forklift packages, and placing forklift-includes.yml files next to the places where files will be imported provides some nice locality between declarations and their effects, and it's also familiar with the idea of importing certain Go packages in certain other Go packages. On the other hand, allowing forklift-includes.yml files to be anywhere other than the root of the pallet would expose some additional complexity for people who need to understand what files get included where. We could start with only allowing forklift-includes.yml at the root of the pallet and then later experiment with the ramifications (for transitive layering and for composition of files from multiple upstream pallets) of allowing forklift-includes.yml files in subdirectories of the pallet.

    It's not clear to me how we could inherit everything in a required pallet's forklift-includes.yml file but narrow the files imported from it, or combine it with additional files in the same directory provided by some other required pallet. So I'm not sure the advantages of this approach outweigh its limitations and its complexity.

  • Adding a section to forklift-pallet.yml with a mapping between required pallets and objects declaring a list of objects, each one declaring a list of files/patterns/globs to include or exclude (depending on the object type) from the respective pallet, with the lists applied in order (so that a list of exclusions declared after a list of inclusions will cause the exclusions to override any overlapping inclusions).

    This option would be simple and transparent, and it would make forklift-pallet.yml be a single source of truth for all file imports from required pallets. For users to selectively narrow the inclusion of files from a required pallet, they could just include all the files from it and then exclude particular paths which might conflict with other required pallets. I think we should take this approach, at least for basic use-cases.

    However, we should think about whether/how this design could enable certain files imported from required pallets to be saved to other target locations (e.g. to rename a deployment). For example, maybe we have source-root and target-root fields so that files in source-root can be imported into target-root instead. Then the inclusion/exclusion blobs would be given relative to source-root.

    As an alternative option, we could store these imports in requirements/pallets/{pallet path}/forklift-pallet-imports.yml files. However, I am not sure what (unintended) consequences might result from importing another pallet's requirements directory.

  • Enabling a pallet to declare feature flags, each of which declares a list of files which other pallets can include. This would parallel the way Forklift packages can declare feature flags. We would probably just declare feature flags in forklift-pallet.yml, and then we could add a YAML file into /requirements/pallets/{pallet path}/forklift-pallet-features.yml to select feature flags to enable from the respective required pallet. This makes it easier to declare a public interface from a pallet with pre-defined groups of files which we want to encourage people to reuse (instead of forcing them to reverse-engineer inclusion patterns from inspecting the pallet's file structure), so we might want to use this approach for cleaner composability anyways.

    However, this adds a layer of indirection which makes it harder to figure out which files come from which required pallets. We could mitigate this by automatically generating file import manifests for the required pallets...almost like a checksum, but human-readable. These could be stored as /requirements/pallets/{pallet path}/forklift-pallet-imports-lock.yml which are recomputed (alongside a checksum file to be added by composition: Authenticate pallet requirements #243) whenever we change the corresponding forklift-version-lock.yml or forklift-pallet-features.yml files, for easy inspection without the forklift tool. Or, since we may want to change imports using a text editor without forklift, these could be forklift-pallet-imports-manifest.yml files which are auto-generated for inspection but gitignore'd for version control.

We would probably want to enable any files provided by the pallet to override files imported at the same paths from required pallets. This would make it easy to import all files from some other pallet, and then just override a particular package deployment declaration.

We also need to figure out how to ensure that we safely handle transitive requirements among pallets (especially if we are able to include forklift-includes.yml files from other pallets), how we can prevent circular dependencies, and how we can deal with conflicting files among different pallets. For example, the simplest way to prevent file import conflicts among required pallets is by prohibiting a pallet from importing a non-directory file to the same target path from multiple distinct pallets - instead, the pallet must select which pallet that file will be imported from.

To merge all the layered pallets together, we'd probably want to make an merge filesystem which we can use as the underlay for the pallet as an overlay filesystem. We might want to export the resulting overlay filesystem separately as merged-pallet in the staged pallet bundle.

@ethanjli
Copy link
Member Author

ethanjli commented Jul 5, 2024

Test cases which must be handled include:

  • Linear transitive imports: pallet A is imports files from pallet B, which imports files from pallet C
  • Branched transitive imports: pallet A imports files from pallets B and C, and pallet B also imports files from pallet C
  • Diamond-shaped transitive imports: pallet A imports files from pallets B and C, and pallets B and C import files from pallet D
  • Import cycle: we do not allow pallet A to import files from pallet B and pallet B to simultaneously import files from the same version of pallet A. In other words, a pallet must not be allowed to be in the transitive closure of its required pallets.

In each case, we need to handle overrides of particular files which might be in any pallet.

Thinking carefully about the tree structure of this file import problem may help make the design of the file import mechanism more rigorous with respect to combining transitive file imports from disparate pallets.

@ethanjli
Copy link
Member Author

ethanjli commented Jul 6, 2024

Here's a formal mathematical description of the file import-based pallet layering mechanism I am considering; the process of constructing this description helped me to design this mechanism, but it might not be very useful beyond that purpose (because we can't run it through a logic checker, and documentation should be done in English):

  • Pallet declarations: A pallet has a set of intrinsic files, and (after imports are evaluated through a process called "merging") a set of imported files which are virtually overlaid by Forklift under the intrinsic files whenever the pallet is queried (e.g. for staging). For describing the pallet layering mechanism, the declaration of any pallet P can be formally expressed as a 5-tuple P = ( D(P), IT(P), IS(P), N(P), F(P) ), consisting of:
    • Pallet dependencies: let D(P) be the set of other pallets from which pallet P imports files.
      • Pallet P's dependency on pallet p is specified by a (path of pallet p)/forklift-version-lock.yml file in the /requirements/pallets/ subdirectory.
    • File imports from dependencies:
      • Target paths: let IT(P) be a function with domain D(P) which maps from every p ∈ D(P) to the set of paths in P (relative to P's root) of all files imported by pallet P from pallet p; in other words, IT(P) gives us target file/destination paths for file imports from each pallet dependency. Then {IT(P)(p) | p ∈ D(P)} is the set of paths in P of all files imported by P from its pallet dependencies.
      • Source paths: let IS(P) be a function with domain D(P) which maps from every p ∈ D(P) to a function with domain IT(P)(p) which maps from every f ∈ IT(P)(p) to the path in p (relative to p's root) where file f will be imported from; in other words, IS(P) gives us source file paths for target paths for file imports from each pallet dependency, and IS(P)(p) is the transformation which pallet P makes to locate each file it imports from pallet p. Then {IS(P)(p)(f) | f ∈ IT(P)(p)} is the set of all paths in p of files imported by P from p.
      • These path mappings are specified by **/*.imports.yml files in the /requirements/pallets/(path of pallet p) subdirectory.
    • Intrinsic files:
      • File paths: let N(P) be the set of paths (relative to the pallet's root) of all intrinsic files in pallet P.
      • File contents: let F(P) be a function with domain N(P) which maps from every f ∈ N(P) to the contents of the intrinsic file with path f.
      • These files exist in the Git working tree of the pallet.
  • Behaviors & constraints:
    • Merging: Let N'(P) be the set of paths of all files used when querying pallet P after its imports are evaluated. Let F'(P) be a function with domain N'(P) which maps from every f ∈ N'(P) to the contents of the file used when querying the pallet after imports are evaluated. Then merging pallet P = ( D(P), IT(P), IS(P), N(P), F(P) ) is the process of computing P's "merged pallet" P' = ( N'(P), F'(P) ):
      • Every file is intrinsic or imported: we require that N'(P) = N(P) ∪ (∪{IT(P)(p) | p ∈ D(P)}).
      • Intrinsic files take precedence over imported files: we require ∀ f ∈ N(P) : F'(P)(f) = F(P)(f); in other words, the intrinsic files are treated as an overlay over the imported files.
      • At least one value exists for the contents of any imported file: we also require ∀ f ∈ N'(P) - N(P) : ∃p ∈ D(P) : ∃f' ∈ N'(p) : f' = IS(P)(p)(f) ∧ F'(P)(f) = F'(p)(f'); in other words, merging is a recursive process of querying pallet dependencies for the file contents of their own merged pallets, and every imported file must actually exist in the merged pallet of a pallet dependency.
      • At most one value exists for the contents of any imported file: we require ∀ f ∈ N'(P) - N(P) : ∀ p, p' ∈ D(P) : f ∈ IT(P)(p) ∧ f ∈ IT(P)(p') → F'(p)(IS(P)(p)(f)) = F'(p')(IS(P)(p')(f)); in other words, if a pallet does not provide an intrinsic file at a certain path and instead only imports files from multiple other pallets into that target path, those imported files in those other pallets must have identical contents. This is to prevent the need to automatically handle diamond dependency conflicts between file imports, while removing the need to prevent duplicate file imports from different pallet dependencies whose file contents don't conflict. When those pallet dependencies have different files which are imported to the same target path, the conflict among file imports can be resolved either by 1) including an intrinsic file in the pallet which is the result of manually merging the imported files or by 2) adjusting the file imports so that only one file is imported to the target path in question.
    • No circular dependencies: P is not allowed to be a member of the transitive closure of the dependencies of P. This is to prevent circular dependencies which would make the system's behavior more complicated and difficult to understand, even if they might still satisfy the constraints listed above.
    • No import-modifying imports: For every p ∈ D(P), for every f ∈ IT(P)(p), we require that f must not start with /requirements/pallets. This is to prevent the need to recompute D(P), IS(P), and IT(P) after importing each file, which would then require us to commit to a particular ordering for evaluating file imports because certain imported files could interact in complicated ways with the remaining file imports which would make system behavior difficult to understand.
    • No imports of regular files to the pallet root: For every p ∈ D(P), for every f ∈ IT(P)(p), we require that f must not be the path of a non-directory file in the P's root directory. This is because some of those files, namely /forklift-pallet.yml, and /forklift-repository.yml, contain identifiers which are supposed to be specific to the pallet and thus should not be imported from existing pallets.
    • Caching: for each pallet p ∈ D(P), we can cache N'(p) and {F'(p)(f) | f ∈ N'(p)}. These results could be stored for pallet p in the Forklift cache's /pallets-evaluated/(name of pallet p)@(version or pseudoversion string of pallet p) subdirectory. This removes the need to re-compute merged pallets for every command in the forklift tool which queries a pallet, but more importantly this makes it easy to inspect the contents of a merged pallet.

The overall result of this layering system should be a declarative equivalent of the imperative FROM and COPY directives for layering container images in multi-stage Dockerfiles/Containerfiles (where we can copy certain files at certain paths from previous stages to certain paths in the current stage, and we can also override those files with different file contents). The difference is that the result of a Dockerfile/Containerfile depends on the order of COPY directives within each stage of the Dockerfile/Containerfile, while my proposed system is meant to make it possible to avoid considering the ordering of file imports (by making the result independent of such ordering, by prohibiting configurations where such ordering would matter).

We can decompose IT(P)(p) and IS(P)(p) as follows:

  • The file imports ( IT(P)(p), IS(P)(p) ) from pallet p into pallet P are collectively declared by a set of import declaration files ID(P)(p). Each import declaration file d ∈ ID(P)(p) is located at /requirements/pallets/(path of pallet p)/(name of file d).imports.yml and can be evaluated into a 2-tuple ( ITD(P)(p)(d), ISD(P)(p)(d) ), consisting of:
    • Target paths: let ITD(P)(p) be a function with domain ID(P)(p) which maps from every d ∈ ID(P)(p) to the set of paths in P (relative to P's root) of all files declared by import declaration file d for pallet P to import from pallet p; in other words, ITD(P)(p)(d) gives us a set of target file/destination paths for file imports. Then we construct IT(P)(p) as ∪{IT(P)(p)(d) | d ∈ ID(P)(p)}.
    • Source paths: let ISD(P)(p) be a function with domain ID(P)(p) which maps from every d ∈ ID(P)(p) to a function with domain ITD(P)(p) which maps from every f ∈ ITD(P)(p) to the path (declared by import declaration file d) in p (relative to p's root) where file f will be imported from; in other words, ISD(P)(p) gives us the source file paths for some target paths for file imports, and ISD(P)(p)(d) is the transformation which import declaration file d makes to locate each file it imports from pallet p. Then we construct IS(P)(p) as the function which satisfies the following constraints (which, for simplicity/familiarity, are the same kind of constraints as for computing F'(P)):
      • At least one value exists for the source path of any imported file: we require ∀ f ∈ IT(P)(p) : ∃d ∈ ID(P)(p) : f ∈ ITD(P)(p)(d) ∧ IS(P)(p)(f) = ISD(P)(p)(d)(f'); in other words, every imported file is declared by some import declaration file.
      • At most one value exists for the source path of any imported file: we require ∀ f ∈ IT(P)(p) : ∀ d, d' ∈ ID(P) : f ∈ ITD(P)(p)(d) ∧ f ∈ ITD(P)(p')(d) → ISD(P)(p)(d)(f) = ISD(P)(p)(d')(f); in other words, if multiple import declaration files for the same pallet dependency have the same target path, they must also declare the same source path for that target path.
  • For every d ∈ IF(P)(p), ( ITD(P)(p)(d), ISD(P)(p)(d) ) can be constructed as the result of an ordered sequence of imperative operations (adding files and removing files according to matching paths/globs) declared by d.

With this design, Forklift can arbitrarily reorder the sequence of import declaration files d ∈ ∪{ID(P)(p) | p ∈ D(P)} for evaluating the import declaration files, without affecting the value of ( N'(P), F'(P) ) - just like how Forklift can arbitrarily reorder the loading of packages required by a pallet and arbitrarily reorder the loading of package deployments in a pallet without affecting the actual result of applying the pallet. And sensitivity to ordering continues to be allowed within - limited to be within - the scope of a file. For example, the sequence in which a package deployment's feature flags are applied (which determines the order in which certain Docker Compose files override other Docker Compose files, and which may determine the order in which certain file exports override other file exports) is encapsulated within the package deployment file, and the sequence of operations by which ITD(P)(p)(d) and ISD(P)(p)(d) are constructed for any given d ∈ ID(P)(p) is encapsulated within file d.

@ethanjli
Copy link
Member Author

ethanjli commented Sep 7, 2024

Now that I've implemented file import groups declared as files within the /requirements/pallets directory of a pallet, I wonder if I can add a feature flags section (either to forklift-pallet.yml or as a /features directory) to declare named file import groups which can be used by name in other pallets. One nice thing about the /features directory is that then it's easy to copy a file import group declaration (e.g. to make a modified version) just by copying the file. Additionally, then it'd be easy to expose all declared feature flag declarations imported from other pallets (by just importing files in those other pallets' /features directories).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant