Evolution of the RPM package format #3349

pmatilai · 2024-10-01T09:54:33Z

pmatilai
Oct 1, 2024
Maintainer

As we're headed towards RPM 6.0, it seems like a good time to look back at how we got here.

It's widely documented that RPM was originally written by Erik Troan and Marc Ewing, based on experiences with pms, rpp and pm package managers. It's maybe less well known that the very first Red Hat Linux release did not use rpm but rpp. The first version to use rpm, then of version 1.x and written in Perl, was Red Hat Linux 2.0 released in early 1995. The rpm package format it produced was 1.0. I failed to locate an actual specimen of such a package, but the sources tell their story:

    if ($major > 1) {
        error("Package $packagepath is a version $major RPM package. This version of rpm can only handle
              version 1 RPM packages. Check ftp.redhat.com for information on obtaining a newer version of rpm.");
    }

The familiar RPM package lead and it's magic 0xED, 0xAB, 0xEE, 0xDB is already there, and the payload was cpio, but beyond that it gets fuzzy, Perl is not a language I'm fluent in. For RHL 3.0.3 I was able to obtain some source rpms, and the lead version there says version 2, and the rpm version shipped was 2.0.3. This was already a C implementation, and poking around those sources shines some light to the early days. The following comment is from lib/oldheader.c of rpm 2.0.3 - what we refer to as rpm lead now was initially known as the header:

  /* This *can't* read 1.0 headers -- it needs 1.1 (w/ group and icon fields)
   or better. I'd be surprised if any 1.0 headers are left anywhere anyway.
   Red Hat 2.0 shipped with 1.1 headers, but some old BETAs used 1.0. */

So they were struggling with the obvious limitations of a static sized C structs inside a file format and the inevitable incompatibilities introduced when changing them. It seems a bit amusing from todays perspective, but the world was a very different place 30 years ago. So the major revelation of rpm v2 format was the introduction of the header - a container structure that can hold arbitrary data addressed through keys. It proved so flexible, the same exact format is still used in rpm today!

The v2 format also had a an optional signature between the lead and the header, but this was just a hardcoded 256 byte area deemed sufficient to hold a PGP signature (referring to the original Pretty Good Privacy program). Can you imagine what happened next?

That's right - the signature format that was proven inflexible in v2 was replaced in v3 by another header, container capable of holding arbitrary data. And this was again proven so flexible that the essential structure: lead, signature header, package header and the payload, is still the same in v4, and will be in v6. But we're getting ahead of ourselves here. The signature header allowed storing all manner of extra data: MD5 sum of the header + payload (MD5 being very much state of art then), sizes and of course and an actual signature, optionally. The earliest v3 packages I was able to locate are from Red Hat Linux 4.0 from late 1996. These are still perfectly accessible with rpm of today, nearly 30 years later. Don't let anybody tell you we don't care about backwards compatibility.

If v3 was so flexible, what's with v4 and now v6? Note how the signature and digest of the v3 package are on header + payload combined? While this is actually handy if you're just verifying an entire package, it means that you cannot verify one without the other. So, the integrity of an installed package cannot be verified anymore because the contiguous payload stream is not available, the contents are uncompressed and unpacked all over the place. So v4 introduced header-only signatures and digests. Seems simple enough, until you realize that rpm likes to store extra data in the header during installation: time of install, but also information about any relocated or omitted paths and so on. And once you modify the header the signatures and digests no longer match. To overcome this, something called the immutable region was invented. Technically the region is just another tag in the header, and older rpm's ignored it because they didn't know what it was. Rpm 4.x however could use the information stored in that tag to calculate (painfully, I would add) the exact contents of the header as it was at the time it was built and signed. And with this piece of software magic, you could verify the package metadata of installed software. The implementation had (and still has) its flaws, but conceptually this was a big thing that few people understood.

Another landmark in v4 was so called compressed filenames. Up to then, paths in the package had been stored as-is, but memory was a very scarce resource back then, and as packages and distros started to grow, this was becoming an issue. So v4 "compresses" the path info by storing directories and basenames separately, joined by an index. This provided considerable memory savings. For some reference, 32MB was a lot of memory in the early 2000.

V4 proved so good its still in use 20 years later! Well, sort of. Computers got more powerful, distros got larger. Eventually people were banging into 2G package and file size limits, coming from signed 32bit integer used for various size fields. For some context, when rpm 4.x was introduced, entire distros still fit on one CD, circa 640 megabytes. My first recollection of a package hitting the gigabytes limits was somebody wanting to package a VMware image in rpm. The other major problem was that MD5 was starting to be considered weak, and then just obsolete.

One of the first things I encountered as a newly hired rpm developer was the demand for stronger algorithm for the per-file digests rpm stores for each file contained in a package. I was nowhere near ready to tackle such a thing just then, but tackle we did, in rpm 4.6. It could've gone more smoothly but we all survived. On top of that, rpm 4.6 added support for 64bit package sizes, but as it introduced 64bit integer tags, older rpm versions would not be able to even read such packages at all. Travesty! So what it did was only use these 64bit tags if the situation actually called for them, and so most people never actually noticed a thing. For better or worse. The 64bit integer type was reserved quite early in the header development, but as it as it wasn't implemented before 4.6, it became the single most backwards incompatible change since the v2 format. Should it have already been called v6 then? Maybe, but I sure wasn't ready for that then.

Another important milestone landed in rpm 4.12 where support was added for packaging files over 4GB, using similar compatibility tactics: only enable if actually necessary. This was still incompatible of course, but by then most of the world already had versions capable of reading those 64bit header tags so on that front, things were much smoother than with 4.6. However the real reason for the 4GB file size limit by then was not the header but the payload itself: the cpio format rpm payload uses could not cover larger files, and there was no newer cpio format that we could've switched to. Rpm doesn't need the metadata in the payload, the way tar, cpio and all have it, because rpm has it in the header. So for packages with large files, rpm switched to a new payload format that only carries an index number to the file in the header metadata. Of course, simple as it may be, now rpm was the only tool capable of reading this format and so the venerable rpm2cpio tool was not able to handle these packages. A new tool rpm2archive was born for that purpose. Few people have run into the new format, ten years later 4GB is still a hefty package, and rare. They do exist though.

Fast forward a few more years and the crypto situation suddenly got dramatically worse: distros were looking for FIPS certification and in FIPS mode, MD5 was considered so weak that it was entirely disabled, you just couldn't calculate a simple MD5 anymore. And by then it was obvious that SHA1 would be next in line. So rpm 4.14.2 added some new things to fend off the crypto battle just a little bit more: we couldn't drop the obsolete MD5 header+payload digest because it's part of the documented rpm format. And while we could've just added another stronger algorithm there, there was still the issue of this requiring both the header and the payload to be useful. So 4.14 added a header-only SHA256 digest, and payload-only SHA256 digest - and one that isn't in the vulnerable signature header but signed inside the main package header. Achieving that was an interesting gymnastic: as a result rpm needs to be assembled first forwards using placeholders for the data to come, and then walk backwards from the payload to start to update the digests with the real values.

Now we had contemporary crypto for the entire package, but the foundations were still crumbling: if you cannot calculate MD5 or SHA1 on a system, you cannot build legit packages on it! A new rpm format was long overdue, it was that way already in 2018. So how come we're STILL stuck with v4? Because, in the intervening time rpm has come so ubiquitous that its no longer sufficient to update rpm itself and a Python program there, there's a vast ecosystem built on top and around it. And introducing a new incompatible format is such a herculean task that you look at it for a minute and then go fishing instead.

People tend to get all hyped up at the mention of a new major version, in anticipation of all the fancy new good that it will bring, and that only makes the task look even more herculean. What finally got the v6 ball rolling was the realization that the only way to make it happen is to keep it as compatible as possible, and what more compatible is there than that which rpm can already handle. So in a nutshell, a v6 package is just what v4 with all bells and whistles enabled can do, and obsolete crypto dropped. Which makes it possible to actually deploy such packages in the wild - without all that backwards and forwards compatibility, we'd be stuck with the obsolete crypto of v4 packages for several more years still, because the huge machineries built around rpm, on top of which so much other stuff lives, don't get updated in hurry. People seem disappointed by how boring this v6 format appears, but that's really the only way to deliver it. And besides, all the interesting stuff is IN the package. The cardboard box around it did its job if the contents got delivered safe!

v6 packages will have other differences of course, but these are more in the metadata department whose compatibility can be managed with rpmlib() tracker dependencies. For more details on the v6 format and compatibility info, head over to #2919

And to those who made it this far, thanks for you interest!

Pointers for further reading:

P.S. Those wondering what happened to v5: that was somebody else's dream, and we want to avoid confusion.

jerome-diver · 2024-11-04T12:53:33Z

jerome-diver
Nov 4, 2024

@pmatilai It's interesting to know all this history. Thank you for this time dedicated to the RedHat culture that explains things.
Would it be humanly conceivable to dare to ask that the old online manual can indicate to which version of librpm it refers, but also to use the same time possibly to write an updated version of the manual for the existing versions? So that, in addition, it becomes possible without tearing one's hair out, to learn how to use the API in C.
Thank you for all this humility and all these good intentions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evolution of the RPM package format #3349

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Evolution of the RPM package format #3349

pmatilai Oct 1, 2024 Maintainer

Replies: 1 comment

jerome-diver Nov 4, 2024

pmatilai
Oct 1, 2024
Maintainer

jerome-diver
Nov 4, 2024