This is a GitHub action that discovers 🗺️ container executables! It is used by the shpc-registry-cache.
You can provide a listing of container resource identifiers (via a text file)
and it will store namespaced (based on OCI or Docker registry)
identifiers from the repository root in your location of choice (defaults to
your repository root). In addition to the cache of json files with container
executables that are discovered on the path, we save a counts.json
(essentially a summary across counts) and skips.json
(a cache of containers
that were not successful to extract the filesystem for and we should not try again).
You will need to provide a text file with container URIs to check. An example is provided in the repository here containers.txt. The idea would be that you might dynamically generate this file from a resource (e.g., for the shpc registry cache we derive this list from the BioContainers depot. Once you have the list, the action does the following:
- We install shpc and the [guts software])(https://github.com/singularityhub/guts)
- We run the update_biocontainers.py script that:
- Parses the latest listing of containers from the BioContainers Depot
- Generate a unique list of containers and latest (first appearing) tag 1.
- Read in the skips.json - a cached list of containers that we skip because their guts were not extractable 2.
- For every new identifier to add:
- Prepare a directory to store the new cache entry (a json file)
- Use the pipelib software to sort tags and get the latest.
- Use the guts ManifestGenerator to retrieve a listing of paths and associated files within.
- Filter out known patterns that are not executables of interest.
- Write this output of aliases to the filesystem under the container identifier as a json file.
- After new aliases are added, calculate_frequency.py is run to update global counts.json
The result is alias-level data for each container, along with a global set of counts.
As an example of the usage of this cache, we use these cache entries to populate the Singularity HPC Registry. On a high level, shpc-registry is providing install configuration files for containers. Docker or other OCI registry containers are installed to an HPC system via module software, and to make this work really well, we need to know their aliases. This is where data from the cache comes in! Specifically for this use case this means we:
- Identify a new container, C, not in the registry from the executable cache here
- Create a set of global executable counts, G
- Define a set of counts from G in C as S
- Rank order S from least to greatest}
- Include any entries in S that have a frequency < 10
- Include any entries in S that have any portion of the name matching the container identifier
- Above that, add the next 10 executables with the lowest frequencies, and < 1,000
The frequencies are calculated across the cache here, included in counts.json. This produces a container configuration file with a likely good set of executables that represent the most unique to that container, based on data from the cache.
To learn more about Singularity Registry HPC you can:
- 📖️ Read the documentation 📖️
- ⭐️ Browse the container module collection ⭐️
You will minimally next a text file, with one container unique resource identifier (with or without a namespace) per line. See containers.txt and biocontainers.txt for examples. A table of variables for the action is shown below, along with example usage. The assumption is that you are running the action after having checked out the repository you want to store the cache in.
Name | Description | Required | Default |
---|---|---|---|
token | a ${{ secrets.GITHUB_TOKEN }} to open a pull request with updates |
true | unset |
root | Path of the cache roots (defaults to PWD) | false | pwd |
listing | text file with listing of containers, one per line. | true | unset |
namespace | namespace to add to each container in the listing | false | unset |
org-letter-prefix | set to true to add a letter directory before the organzation name (e.g., docker.io/l/library/ubuntu:latest) | true | false |
repo-letter-prefix: set to true to add a letter directory before the repository name (e.g., docker.io/library/u/ubuntu:latest) | true | false | |
registry-letter-prefix | set to true to add a letter directory before the registry name (e.g., d/docker.io/library/ubuntu:latest) | true | false |
dry_run | don't push changes (dry run only) | false | false |
branch | branch to push to | false | main |
As an example of namespace, see the biocontainers.txt file. We would want to define namespace as "quay.io/biocontainers" in the action, as the text file only has partial names. For pushing, make sure your repository allows pushes from actions.
Here is a "vanilla" example updating a container executable cache in the checked out repository present working directory from the containers.txt file.
name: Update Container Cache
on:
workflow_dispatch:
schedule:
- cron: 0 0 * * 3
jobs:
default-run:
name: Update Cache
runs-on: ubuntu-latest
steps:
- name: Checkout Repository
uses: actions/checkout@v3
- name: Update from Containers
uses: singularityhub/container-executable-discovery@main
with:
token: ${{ secrets.GITHUB_TOKEN }}
listing: containers.txt
dry_run: true
The remaining recipes assume you have the "on" and "name" directive (these are just jobs): Do the same, but for a dry run (no GitHub token required):
jobs:
dry-run:
name: Update Cache
runs-on: ubuntu-latest
steps:
- name: Checkout Repository
uses: actions/checkout@v3
- name: Update from Containers
uses: singularityhub/container-executable-discovery@main
with:
listing: containers.txt
dry_run: true
Set a namespace (e.g., as we'd need for biocontainers.txt)
jobs:
namespace:
name: Update Cache (Namespace)
runs-on: ubuntu-latest
steps:
- name: Checkout Repository
uses: actions/checkout@v3
- name: Update from Containers
uses: singularityhub/container-executable-discovery@main
with:
token: ${{ secrets.GITHUB_TOKEN }}
listing: biocontainers.txt
namespace: quay.io/biocontainers
Set an organization (the repository organization or username) prefix, e.g.,
quay.io/vanessa/salad:latest would be stored under quay.io/v/vanessa/salad:latest.json
.
jobs:
org-prefix:
name: Update Cache (Org Prefix)
runs-on: ubuntu-latest
steps:
- name: Checkout Repository
uses: actions/checkout@v3
- name: Update from Containers
uses: singularityhub/container-executable-discovery@main
with:
token: ${{ secrets.GITHUB_TOKEN }}
org-letter-prefix: true
listing: containers.txt
Or set a repository prefix, e.g., quay.io/vanessa/salad:latest would be stored under quay.io/vanessa/s/salad:latest.json
:
jobs:
repo-prefix:
name: Update Cache (Repo Prefix)
runs-on: ubuntu-latest
steps:
- name: Checkout Repository
uses: actions/checkout@v3
- name: Update from Containers
uses: singularityhub/container-executable-discovery@main
with:
token: ${{ secrets.GITHUB_TOKEN }}
repo-letter-prefix: true
listing: containers.txt
Finally, set a registry prefix (more unlikely since there are few, but available)
e.g., quay.io/vanessa/salad:latest would be stored under q/quay.io/vanessa/salad:latest.json
:
jobs:
registry-prefix:
name: Update Cache (Registry Prefix)
runs-on: ubuntu-latest
steps:
- name: Checkout Repository
uses: actions/checkout@v3
- name: Update from Containers
uses: singularityhub/container-executable-discovery@main
with:
token: ${{ secrets.GITHUB_TOKEN }}
registry-letter-prefix: true
listing: containers.txt
And that's it! If you have a dynamic listing of containers, you'll likely want to write a step before using the action to generate the file.
The pull request will update or create (within the cache root):
- a counts.json file with total counts across the cache
- a skips.json to store as a cache of containers to skip
- a namespaced hierarchy (according to your preferences), e.g.,
quay.io/vanessa/salad:latest.json
, each a lookup dictionary with paths as keys, and binaries / assets discovered there as values.
Note that we filter out patterns that are likely not executables. See the scripts folder to see this logic!
The action is powered by a python library container_discovery that is provided and installed alongside the action. Since this is primarily used here, we don't publish to pypi. If you want to install it for your own use:
$ git clone https://github.com/singularityhub/container-executable-discovery
$ cd container-executable-discovery/lib
$ pip install .
And then interact with the container_discovery
module. You can look at
examples under scripts - this is how the action runs!
This registry showcases a container executable cache, and specifically includes over 8K containers from BioContainers. If you would like to add another source of container identifiers contributions are very much welcome!
This code is licensed under the MPL 2.0 LICENSE.
Footnotes
-
For the step that grabs the "latest" tag, since the container URI (without any tag) can be used to get a listing of all tags, it isn't important to be correct to get the latest tag - this can be easily obtained later in a workflow from the unique resource identifier without a tag. ↩
-
There are several reasons for skipping a container. One is that the guts software is not able to extract every set of container guts to the filesystem. A container that attempts to extract particular locations, or that takes up too much space for the GitHub runner will be skipped. Another reason is the pipelib software failing to filter a meaningful set of versioned tags and sort them (e.g., the listing comes back empty and there are no tags known to retrieve). In practice this is a small number as a percentage of the total. ↩