Introduction of "rpms.lock.yaml" file #2908

onosek · 2024-02-14T09:15:25Z

onosek
Feb 14, 2024

Introduction of rpms.lock.yaml file

Context

My team is currently working on the implementation of a hermetic build process for containers that use RPMs. The build process runs in a network-isolated build environment. To be able to implement this, we need to pre-fetch all required RPMs and a full chain of their transitive dependencies to be available during the build process (except for packages that are already installed in the parent container image). As part of this requirement, we also want to strive towards reproducibility. To prefetch all required RPMs, including dependencies, and to be able to pre-fetch the same set of them when we re-run the build with the same input parameters, we need a "lock" file similar to one known from Python - requirements.txt that is programmatically generated from an input file called requirements.in.

To be transparent and to give you a chance to provide feedback as RPM ecosystem SMEs I would like to present to you the format of the lock file we designed.

For more details about our requirements for the container build process, you can see SLSA requirements, especially these:

rpms.lock.yaml

A file that contains a list of fully resolved dependencies (their URLs) that cachi2 (https://github.com/containerbuildsystem/cachi2/) will need to download for a hermetic build. This file contains a different list of RPMs per architecture. Only the RPMs listed in this file will be available during the build process as the build process has no access to the internet.

Note: This file contains only RPMs that will be installed on top of the parent image - i.e. RPMs that are required but are already installed in the parent container image are not included in this file.

This will be generated and maintained programmatically based on an input file (rpms.in.yaml) that is out of the scope of this doc.

📔 Notes about format design

YAML format is extensible, allows to write schema to allow simple validation, is platform independent, and easily consumable by computers as well as humans.
"repoid" is optional. If defined it will be propagated to the container build so RPMs that will be installed during the container build process will have this repoid listed as an origin repo ("From repo" when you run dnf info $PKG). This is beneficial for example for a container vulnerability scanning tool Clair.
"sources" are optional. If provided, they will be collected by the cachi2 during the container build process to allow generation of source containers.

⚙️ Example

lockfileVersion: 1
lockfileVendor: abcde
arches:
  - arch: x86_64
    packages:
      - url: https://mirrors.nic.cz/pub/fedora/linux/updates/38/Everything/x86_64/Packages/v/vim-enhanced-9.1.113-1.fc38.x86_64.rpm
        checksum: sha256:545d77bb579a8fb3e87ecd1d5acf616b4b837612f189206171edad73fd4864ab
        size: 1975462
        repoid: updates
      - url: https://mirrors.nic.cz/pub/fedora/linux/updates/38/Everything/x86_64/Packages/v/vim-common-9.1.113-1.fc38.x86_64.rpm
        checksum: sha256:cce5fcc8b6b0312caeca04a19494358888b00c125747f5c2d2bd8f006665c730
        size: 8207912
        repoid: updates
      - url: https://mirrors.nic.cz/pub/fedora/linux/updates/38/Everything/x86_64/Packages/v/vim-filesystem-9.1.113-1.fc38.noarch.rpm
        checksum: sha256:8743bcb074aed6aa20914b7d0258cd6938e3642fe3550279bb1c66c6300d936a
        size: 18006
        repoid: updates
      - url: https://mirrors.nic.cz/pub/fedora/linux/releases/38/Everything/x86_64/os/Packages/g/gpm-libs-1.20.7-42.fc38.x86_64.rpm
        checksum: sha256:ad16ec814c4423d007d218a3f45d2e39d3dab00fc8c0d75eef176041594e3970
        size: 20875
        repoid: fedora
    sources:
      - url: https://mirrors.nic.cz/pub/fedora/linux/updates/38/Everything/source/tree/Packages/v/vim-9.1.113-1.fc38.src.rpm
        checksum: sha256:feeff63354d1d23f6636a575ae0351a71174c60f7c5c01da295c756baa4a6352
        size: 14671827
        repoid: updates-source
      - url: https://mirrors.nic.cz/pub/fedora/linux/releases/38/Everything/source/tree/Packages/g/gpm-1.20.7-42.fc38.src.rpm
        checksum: sha256:ad004fdcb4a95a848fd00d57765476d8d576bf6824fd073f56290b6c52d7e5b2
        size: 250906
        repoid: fedora-source
  - arch: aarch64
    packages:
... SNIP ...

We understand that managing such a lock file manually is going to be very cumbersome and difficult. The long-term plan is to have a tool that will be able to automate it. This however is not within the scope of our minimal viable product.

Possible future extensions

These are some possible extensions that we can envision may become relevant at some point in future and can be easily added because of the YAML format, but they are not planned right now as our use case doesn't need them.

Support for metalink and mirrorlists.
Support for modules.
Checksums of the files.
Mapping of RPMs to the source RPMs.

eskultety · 2024-02-14T09:56:22Z

eskultety
Feb 14, 2024

I think that in context of reproducibility and secure supply chain SW delivery lockfiles as a concept makes sense and from my layman's perspective it looks plausible however I don't feel competent enough to review this format in depth. What I as a stakeholder am interested in knowing is the following:

whether the RPM/DNF community would officially endorse this format officially and adopt it natively in some way
whether (speaking of potential native adoption here) there's actually any intersection where the format would be useful to let's say DNF itself
whether anyone from the RPM/DNF community sees potential problems/pitfalls and would object to the format as proposed before this finds its way as the de facto standard to secure supply chain SW delivery pipelines where RPMs are involved

0 replies

eskultety · 2024-02-14T11:17:36Z

eskultety
Feb 14, 2024

Looking at the lockfile format I can see a repeating pattern of:

- url: https://someurl
  repoid: foo
- url: https://someotherurl
  repoid: foo
- url: https://anotherurl
  repoid: bar

IOW any tool which wants to process this has to analyze every single package to categorize it under some repository (optionally denoted by repoid), so if repoid isn't provided such a tool is free to choose how to group uncategorized packages (assigning them to 1 or multiple repos). Wouldn't it be better if we simplified such a tool's job by explicitly hinting at which repo what packages belong? I mean in my mind that would match the current .repo file format better. Consider the following example of what I just wrote:

lockfileVersion: 1
arches:
  - arch: x86_64
    repos:
      - repoid: updates
        rpms:
          - url: https://mirrors.nic.cz/pub/fedora/linux/updates/38/Everything/x86_64/Packages/v/vim-enhanced-9.1.031-1.fc38.x86_64.rpm
          - url: https://mirrors.nic.cz/pub/fedora/linux/updates/38/Everything/x86_64/Packages/v/vim-common-9.1.031-1.fc38.x86_64.rpm
          - url: https://mirrors.nic.cz/pub/fedora/linux/updates/38/Everything/x86_64/Packages/v/vim-filesystem-9.1.031-1.fc38.noarch.rpm
     - repoid: fedora
       rpms:
          - url: https://mirrors.nic.cz/pub/fedora/linux/releases/38/Everything/x86_64/os/Packages/g/gpm-libs-1.20.7-42.fc38.x86_64.rpm
     - repoid: fedora-sources
       rpms:
         - url: https://mirrors.nic.cz/pub/fedora/linux/releases/38/Everything/source/tree/Packages/g/gpm-1.20.7-42.fc38.src.rpm
   # A SINGLE SINK REPO FOR THE REST PACKAGES NOT BELONGING TO A SPECIFIC REPO
     - rpms:
         - url: http://example.com/foo.x86_64.rpm
         - url: http://example.com/bar.x86_64.rpm

10 replies

lubomir Feb 14, 2024

I did not say it's a problem, I just wanted to point out an interesting edge case that happens to be solved by the format already.

eskultety Feb 16, 2024

@lubomir while possible by the format I think having some semi-official guidance on the problem would be preferred in that I'm personally not a fan of redundancy - in this case specifying the same repo multiple times assigning a different type to it each time. What it would inherently lead to IMO is a repoid conflict that the lockfile processor would have to solve, IOW it would need to be prepared to see the same repoid which should otherwise be unique and expect a different type, not sure I like that kind of approach because to me it smells a bit like non-determinism, hence my point on having a guidance.

lubomir Feb 16, 2024

You're the one proposing the format change. How do you think the possibility of having different kinds of packages under same repo should be handled? Should it be declared "unsupported"? (I'm fine with that.)

eskultety Feb 16, 2024

No, declaring such a thing unsupported is IMO too harsh. Like I suggested earlier, I would find acceptable assuming the default type: rpm in this case (or maybe even type: any) in which case producers of such data must accept the fact that by mixing multiple types of packages under a single repo in the lockfile may lead to less accurate results in terms of e.g. producing an SBOM out of the prefetched artifacts, otherwise usage of type information should IMO not be deemed of any significance when it comes to the fetched artifacts themselves, IOW the different types of RPMs would have been part of the original repo same way as they're going to be after a pre-fetch or am I mistaken in my reasoning?

eskultety Feb 16, 2024

@eskultety I have no problem with this, I would just suggest to keep sources as sources and not rpms, because then you would need a heuristic logic in the client consuming this format to detect sources (although it may be as simple as endswith((".src.rpm",".srpm")).

In hindsight the choice for the term being 'rpms' isn't the best one, in reality the more common one truly is packages .

@Tojaj giving this format proposal some more thought, sources are arch independent, so shouldn't they be actually extracted out of that dictionary like this instead?

arches:
  - arch: foo
    repos:
       repoid: barid
       packages:
          - ...
sources:
  repos:
     repoid: bazid
     packages:
        - *.src.rpm

eskultety · 2024-02-16T13:15:22Z

eskultety
Feb 16, 2024

@Tojaj @lubomir @onosek I just read this:rpm-software-management/dnf5#833 which actually pitched the idea of standardizing on the format across the whole RPM ecosystem rather than letting other communities implementing similar idea from scratch over and over again. Shouldn't this proposal be actually better suited in that discussion and bring it back to life again?

0 replies

eskultety · 2024-02-19T11:04:48Z

eskultety
Feb 19, 2024

Something that occurred to me just now - why does the lockfile contain multiple architectures? Isn't a purpose of a more-or-less generic lockfile to address a single use case? IOW the way I'm now looking at the format it looks like whatever is consuming the lockfile is supposed to prefetch data for multiple parallel container build runs based on architecture, so it feels like a batch operation. I'm ambivalent on such design, so I wonder from a given pipeline's black box perspective if the nuance of architectures as a list isn't an implementation detail of a given pipeline rather than a generic thing that may be useful outside of the intended private use case.

2 replies

Tojaj Feb 19, 2024

Originally arches were split, but Liora Milbaum proposed that a single file would be better as it would guaranteed consistency.
As Liora has background in both Red Hat In-Vehicle Operating System [1] and Bootc [2] who are both potential candidates for usage of the lock files and especially [1] cares very much about consistency and safety. We evaluated this as reasonable.

Building a container for multiple architectures is single use-case, actually a use-case that will be more and more common because x86_64 arch domination is shifting (wide adoption of ARM, rise of RISC) so multi arch builds are must have and consistency between arches is desired. I don't think this is (and ever was a private use-case).

If you want to have separate arches, it can be still easily done, current format doesn't block this in any way. Frankly it also feel more natural to have architecture listed in the file directly rather then encoding it into the filename, where it may get lost if you share just the file content (e.g. sharing by sharing file content via paste-bin or GitHub Gists).

[1] https://www.redhat.com/en/blog/new-standard-red-hat-vehicle-operating-system-modern-and-future-vehicles
[2] https://github.com/osbuild/bootc-image-builder

eskultety Feb 20, 2024

Building a container for multiple architectures is single use-case, actually a use-case that will be more and more common because x86_64 arch domination is shifting (wide adoption of ARM, rise of RISC) so multi arch builds are must have and consistency between arches is desired. I don't think this is (and ever was a private use-case).

I never argued against multiple architectures...

If you want to have separate arches, it can be still easily done, current format doesn't block this in any way. Frankly it also feel more natural to have architecture listed in the file directly rather then encoding it into the filename, where it may get lost if you share just the file content (e.g. sharing by sharing file content via paste-bin or GitHub Gists).

...nor did I suggest encoding the arch into the file name.
I only commented on the fact that we're treating the lockfile not as an input for a single operation, but from what I can see - a batch operation, IOW use a single lockfile for a single artifact prefetch which will then be re-distributed among multiple concurrent arch builds of the same container. My only argument revolves around let's call it "pipeline encapsulation", meaning that there should be a single architecture prefetch for a single container build (again, we're getting too specific in the overall use case of the lockfile as proposed, but okay). So, yes, the current format doesn't block single arch use cases, however, I wonder if more than a single arch (doesn't matter which one, the lockfile would tell you that with an attribute)in the lockfile is a desirable design.
Let me elaborate on this some more so that it's clear why I'm even poking it.
As you propose, sources would be tied to architecture, however, that's not always the case (e.g. Fedora) so from that perspective sources should be modeled outside of a given architecture, but that's a Fedora policy, not a DNF constraint, so some sources will be architecture-dependent and some won't which naturally leads to a split design of how source RPMs should be modeled in the YAML depending on what distro and arch you use. Conversely, if you only ever allow the lockfile to describe a single architecture instead of batching, it doesn't matter because you'll only ever process a single architecture, so you'd always model sources the exact same way whether they're tied to an architecture or not, it would be modeled as just another repo like all the others, it would just happen to be for srpms. Maybe I'm being naive but it would IMO simplify the design some more without sacrificing too much on the pipeline use case, you'd just need to have multiple lockfiles stored (one per arch) which to me sounds like a reasonable tradeoff.

ppisar · 2024-02-19T11:45:26Z

ppisar
Feb 19, 2024

Reading all the fuss with repoid, are you sure you are targeting the right project (rpm)? rpm tool has no notion of repositories. It only works with locally accessible RPM packages. Even specifying remote RPM packages with an URL and letting rpm tool to download them is, although implemented, frowned by me https://bugzilla.redhat.com/show_bug.cgi?id=2216754.

I feel you rather want to engage with package mangers like DNF, zypper, dragora etc. A relevant request for DNF5 is rpm-software-management/dnf5#833.

0 replies

pmatilai · 2024-02-19T11:54:49Z

pmatilai
Feb 19, 2024
Maintainer

Yeah this doesn't seem particularly relevant to rpm itself. But if people want to use this as a depsolver agnostic place to discuss it, you're welcome to do so.

0 replies

dmnks · 2024-02-19T13:17:42Z

dmnks
Feb 19, 2024
Maintainer

I agree with @ppisar and @pmatilai above, this proposal seems to be sitting one "floor" above us 😄 We don't deal with repositories or the distribution of packages in general.

That said, of course, if anything comes out of this discussion that impacts RPM itself, we're happy to help. Also, like Panu said above, feel free to keep the discussion here now that it's ongoing.

0 replies

j-mracek · 2024-02-19T13:30:18Z

j-mracek
Feb 19, 2024

I just want to point out that structure might be vie reversa. Instead

  - arch: aarch64
    repos:
        repoid:
           packages:

To use

packages:
    wget:
        repoid:
            baseos
        arch:
            x86_64
            s390

Then I discovered one problem when one file will be used for all architecture. Repositories has the same problem as RPMs. Not all repositories are available for all architecture.

Somehow I do not recommend to combine repositories and url for packages. Please pick one option and not a combination them for one particular package. Repositories uses metalinks to redistribute network workflow and to improve stability and performance.

0 replies

ffesti · 2024-02-19T13:58:51Z

ffesti
Feb 19, 2024
Maintainer

I wonder if you need to store the repo file. Just having a name is not of much use. Especially for yum/dnf where the name is in the local file and can be changed at will. You will also need to store $release (you obviously already have $arch) to be able to interpret the links in there.

0 replies

voxik · 2024-02-20T08:53:27Z

voxik
Feb 20, 2024

RPM checks dependencies, doesn't it? Then I don't see why RPM should not allow only the dependencies as specified in some lock file, maybe ignoring the full URL just focusing on NVR

7 replies

voxik Feb 20, 2024

Right, RPM does not have any concept of repos. As it (mostly does not care about URLs and generally) does not care about where the RPM comes from. But let me explain what I mean.

This lock file (based on your initial example, and sorry if I don't have the YAML syntax right) in the simplest form could be interesting for RPM IMHO:

lockfileVersion: 1
packages:
  - vim-enhanced-9.1.031-1.fc38.x86_64.rpm
  - vim-common-9.1.031-1.fc38.x86_64.rpm
  - vim-filesystem-9.1.031-1.fc38.noarch.rpm
  - gpm-libs-1.20.7-42.fc38.x86_64.rpm
  - ...

If I did rpm -I vim-enhanced-10.0.038-1.fc38.x86_64.rpm, the command would fail, because the NVR does not match the lock file. If I did rpm -I foo, this wold also fail, because foo package is not listed in the lock file. But doing rpm -I vim-enhanced-9.1.031-1.fc38.x86_64.rpm would work.

On top of that, if the command would be rpm -I https://mirrors.nic.cz/pub/fedora/linux/updates/38/Everything/x86_64/Packages/v/vim-enhanced-9.1.031-1.fc38.x86_64.rpm RPM could also check the URL if it was part of the lock file as in your initial proposal.

IOW this would allow version pinning which could probably be useful RPM feature.

pmatilai Feb 20, 2024
Maintainer

Just a random observation: Fedora is being used as the example here, but Fedora does NOT preserve the updates history in the repository, only the first (in the base repo) and the last are kept. So any of this basically requires maintaining a local mirror which never deletes packages. And with that, you can't use official repodata either, you need to create your own or use explicit URLs (as the examples here do).

voxik Feb 20, 2024

Fedora also uses Mirror Manager, therefore the explicit URLs (in the original example) are going against that.

eskultety Feb 21, 2024

If I did rpm -I vim-enhanced-10.0.038-1.fc38.x86_64.rpm, the command would fail, because the NVR does not match the lock file. If I did rpm -I foo, this wold also fail, because foo package is not listed in the lock file. But doing rpm -I vim-enhanced-9.1.031-1.fc38.x86_64.rpm would work.

@voxik how would rpm know where to get that package from without the knowledge of repos or URLs?

voxik Feb 22, 2024

RPM knows that from command line parameter. Obviously rpm -I vim-enhanced-10.0.038-1.fc38.x86_64.rpm will look for the package in current directory. IOW the information where to get the package to install is always external to RPM.

eskultety · 2024-02-20T09:53:43Z

eskultety
Feb 20, 2024

Just to keep this in sync what was discussed via a private channel, package checksums (in some form) will need to be introduced to the format.

3 replies

onosek Feb 20, 2024
Author

And here they are. I updated the lockfile format with items checksum and (file) size. There is also an expansion in the header (two new records: lockfileVendor and lockfileType - that distinguishes between rpms.in.yaml and rpms.lock.yaml).

eskultety Feb 21, 2024

lockfileType - that distinguishes between rpms.in.yaml and rpms.lock.yaml)

I think this field is largely irrelevant since rpms.in.yaml is definitely completely out of scope for this discussion (as this is just the basis for the actual rpms.lock.yaml) as well as because it doesn't really do any job distinguishing anything, IMO you either have a full lockfile or you don't and in case of rpms.in.yaml it's just a template it's not even a lockfile per-se and you need a dedicated tool to understand specifically and only this type of description file to be able to get the final rpms.lock.yaml. So, let's not get ourselves concerned about the former here at all.

onosek Feb 22, 2024
Author

OK, fixed.

Tojaj · 2024-02-26T15:55:11Z

Tojaj
Feb 26, 2024

A thought on a possible (future?) extension of the format.

Based on a remark from RPM team during a meeting some time ago, where they mentioned that lock file could serve as a manifest. I was thinking about that and in deed, "manifest" may be a nice potential use-case for the format.
(Just for context, right now, the main use-case for the format - if I simplify a lot - is to provide a list of RPM URLs that cachi2 needs to download for hermetic (container) build process and that's it.)

I think that if we add two more pieces of information for each RPM, then this format would be able to serve as a manifest quite well.

The two pieces of information would be:

name (package name) - mainly for convenience, so consumer don't have to parse URL for that info.
evra (epoch:version-release.architecture) - epoch is usually not part of the filename so without listing it in the file, it would be unknown.

Example of the extended format (btw. the order of attributes doesn't matter as these are dictionaries):

lockfileVersion: 1
lockfileVendor: abcde
arches:
  - arch: x86_64
    packages:
      - url: https://mirrors.nic.cz/pub/fedora/linux/updates/38/Everything/x86_64/Packages/v/vim-enhanced-9.1.113-1.fc38.x86_64.rpm
        checksum: sha256:545d77bb579a8fb3e87ecd1d5acf616b4b837612f189206171edad73fd4864ab
        size: 1975462
        repoid: updates
        name: vim-enhanced
        evra: 2:9.1.113-1.fc38.x86_64
      - url: https://mirrors.nic.cz/pub/fedora/linux/updates/38/Everything/x86_64/Packages/v/vim-common-9.1.113-1.fc38.x86_64.rpm
        checksum: sha256:cce5fcc8b6b0312caeca04a19494358888b00c125747f5c2d2bd8f006665c730
        size: 8207912
        repoid: updates
        name: vim-common
        evra: 2:9.1.113-1.fc38.x86_64
... SNIP ...

These two new attributes would be necessary if someone wants to use the format as a manifest - because there is no guarantee that RPM filename in the URL would contain all the important information.
For example, a foo.rpm filename doesn't provide any info about epoch, version, release or architecture of the package, and there is also no guarantee that the actual name of the package is foo.
And because of ephemeral nature of some repositories and packages in them, it may not be possible to re-download the exact RPM later to read these metadata from the same URL. That's why I suggest this extension.

Btw. To illustrate the need for these, see the manifests used by coreos: https://github.com/coreos/fedora-coreos-config/blob/testing-devel/manifest-lock.x86_64.json

0 replies

Introduction of "rpms.lock.yaml" file #2908

Introduction of rpms.lock.yaml file

Context

rpms.lock.yaml

📔 Notes about format design

⚙️ Example

Possible future extensions

Replies: 12 comments · 22 replies

pmatilai Feb 19, 2024 Maintainer

dmnks Feb 19, 2024 Maintainer

ffesti Feb 19, 2024 Maintainer

pmatilai Feb 20, 2024 Maintainer

onosek Feb 20, 2024 Author

onosek Feb 22, 2024 Author

Replies: 12 comments 22 replies

pmatilai
Feb 19, 2024
Maintainer

dmnks
Feb 19, 2024
Maintainer

ffesti
Feb 19, 2024
Maintainer

pmatilai Feb 20, 2024
Maintainer

onosek Feb 20, 2024
Author

onosek Feb 22, 2024
Author