Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: report unused inputs for the tar rule #951

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Commits on Sep 27, 2024

  1. perf: report unused inputs for the tar rule

    The `mtree` spec passed to the `tar` rule very often selects a subset of the
    inputs made available through the `srcs` attribute. In many cases, these
    subsets do not break down cleanly along dependency-tree lines and there
    is no simple way just pass less content to the `tar` rule.
    
    One prominent example where this occurs is when constructing the tars
    for OCI image layers. For instance when [building a Python-based
    container image](https://github.com/bazel-contrib/rules_oci/blob/main/docs/python.md),
    we might want to split the Python interpreter, third-party dependencies, and
    application code into their own layers. This is done by [filtering the
    `mtree_spec`](https://github.com/aspect-build/bazel-examples/blob/85cb2aaf8c6e51d5e9e086cc94b94ab896903fb0/oci_python_image/py_layer.bzl#L39).
    
    However, in the operation to construct a `tar` from a subsetted mtree,
    it is usually still an unsubsetted tree of `srcs` that gets passed. As
    a result, the subset tarball is considered dependent upon a larger set
    of sources than is strictly necessary.
    
    This over-scoping runs counter to a very common objective associated with
    breaking up an image into layers - isolating churn to a smaller slice of
    the application. Because of the spurious relationships established in
    Bazel's dependency graph, all tars get rebuilt anytime any content in
    the application gets changed. Tar rebuilds can even be triggered by
    changes to files that are completely filtered-out from all layers of the container.
    
    Redundent creation of archive content is usually not too computationally
    intensive, but the archives can be quite large in some cases, and
    avoiding a rebuild might free up gigabytes of disk and/or network
    bandwidth for
    better use. In addition, eliminating the spurious dependency edges
    removes erroneous constraints applied to the build action schedule;
    these tend to push all Tar-building operations towards the end of a
    build, even when some archive construction could be scheduled much earlier.
    
    ## Risk assessment and mitigation
    
    The `unused_inputs_list` mechanism used to report spurious dependency
    relationships is a bit difficult to use. Reporting an actually-used
    input as unused can create difficult to diagnose problems down the line.
    
    However, the behaviour of the `mtree`-based `tar` rule is sufficiently
    simple and self-contained that I am fairly confident that this rule's
    used/unused set can be determined accurately in a maintainable fashion.
    
    Out of an abundance of caution I have gated this feature behind a
    default-off flag. The `tar` rule will continue to operate as it had
    before - typically over-reporting dependencies - unless the
    `--@aspect_bazel_lib//lib:tar_compute_unused_inputs` flag is passed.
    
    ### Filter accuracy
    
    The `vis` encoding used by the `mtree` format to resiliently handle path
    names has a small amount of "play" to it - it is reversable but the
    encoded representation of a string is not
    unique. Two unequal encoded strings might decode to the same value; this
    can happen when at least one of the encoded strings contains unnecessary
    escapes that are nevertheless honoured by the decoder.
    
    The unused-inputs set is determined using a filter that compares
    `vis`-encoded strings. In the presence of non-canonically-encoded
    paths, false-mismatches can lead to falsely reporting that an input is
    unused.
    
    The only `vis`-encoded path content that is under the control of callers
    is the `mtree` content itself; all other `vis`-encoded strings are
    constructed internally to this package, not exposed publicly, and are
    all derived using the `lib/private/tar.bzl%_vis_encode` function; all of
    these paths are expected to compare exactly. Additionally, it is expected that
    many/most users will use this package's helpers (e.g. `mtree_spec`) when
    crafting their mtree content; such content is also safe. It is only when
    the user crafts their own mtree, or modifies an mtree spec's `content=`
    fields' encoding in some way, that a risk of inaccurate reporting
    arises. The chances for this are expected to be minor since this seems
    like an inconvenient and not-particularly-useful thing for a user to go
    out of their way to do.
    plobsing committed Sep 27, 2024
    Configuration menu
    Copy the full SHA
    6517629 View commit details
    Browse the repository at this point in the history