Skip to content

Commit

Permalink
perf: report unused inputs for the tar rule
Browse files Browse the repository at this point in the history
The `mtree` spec passed to the `tar` rule very often selects a subset of the
inputs made available through the `srcs` attribute. In many cases, these
subsets do not break down cleanly along dependency-tree lines and there
is no simple way just pass less content to the `tar` rule.

One prominent example where this occurs is when constructing the tars
for OCI image layers. For instance when [building a Python-based
container image](https://github.com/bazel-contrib/rules_oci/blob/main/docs/python.md),
we might want to split the Python interpreter, third-party dependencies, and
application code into their own layers. This is done by [filtering the
`mtree_spec`](https://github.com/aspect-build/bazel-examples/blob/85cb2aaf8c6e51d5e9e086cc94b94ab896903fb0/oci_python_image/py_layer.bzl#L39).

However, in the operation to construct a `tar` from a subsetted mtree,
it is usually still an unsubsetted tree of `srcs` that gets passed. As
a result, the subset tarball is considered dependent upon a larger set
of sources than is strictly necessary.

This over-scoping runs counter to a very common objective associated with
breaking up an image into layers - isolating churn to a smaller slice of
the application. Because of the spurious relationships established in
Bazel's dependency graph, all tars get rebuilt anytime any content in
the application gets changed. Tar rebuilds can even be triggered by
changes to files that are completely filtered-out from all layers of the container.

Redundent creation of archive content is usually not too computationally
intensive, but the archives can be quite large in some cases, and
avoiding a rebuild might free up gigabytes of disk and/or network
bandwidth for
better use. In addition, eliminating the spurious dependency edges
removes erroneous constraints applied to the build action schedule;
these tend to push all Tar-building operations towards the end of a
build, even when some archive construction could be scheduled much earlier.

## Risk assessment and mitigation

The `unused_inputs_list` mechanism used to report spurious dependency
relationships is a bit difficult to use. Reporting an actually-used
input as unused can create difficult to diagnose problems down the line.

However, the behaviour of the `mtree`-based `tar` rule is sufficiently
simple and self-contained that I am fairly confident that this rule's
used/unused set can be determined accurately in a maintainable fashion.

Out of an abundance of caution I have gated this feature behind a
default-off flag. The `tar` rule will continue to operate as it had
before - typically over-reporting dependencies - unless the
`--@aspect_bazel_lib//lib:tar_compute_unused_inputs` flag is passed.

### Filter accuracy

The `vis` encoding used by the `mtree` format to resiliently handle path
names has a small amount of "play" to it - it is reversable but the
encoded representation of a string is not
unique. Two unequal encoded strings might decode to the same value; this
can happen when at least one of the encoded strings contains unnecessary
escapes that are nevertheless honoured by the decoder.

The unused-inputs set is determined using a filter that compares
`vis`-encoded strings. In the presence of non-canonically-encoded
paths, false-mismatches can lead to falsely reporting that an input is
unused.

The only `vis`-encoded path content that is under the control of callers
is the `mtree` content itself; all other `vis`-encoded strings are
constructed internally to this package, not exposed publicly, and are
all derived using the `lib/private/tar.bzl%_vis_encode` function; all of
these paths are expected to compare exactly. Additionally, it is expected that
many/most users will use this package's helpers (e.g. `mtree_spec`) when
crafting their mtree content; such content is also safe. It is only when
the user crafts their own mtree, or modifies an mtree spec's `content=`
fields' encoding in some way, that a risk of inaccurate reporting
arises. The chances for this are expected to be minor since this seems
like an inconvenient and not-particularly-useful thing for a user to go
out of their way to do.
  • Loading branch information
plobsing committed Sep 27, 2024
1 parent a75770b commit 6517629
Show file tree
Hide file tree
Showing 2 changed files with 109 additions and 5 deletions.
5 changes: 5 additions & 0 deletions lib/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,11 @@ bool_flag(
build_setting_default = True if is_bzlmod_enabled() else False,
)

bool_flag(
name = "tar_compute_unused_inputs",
build_setting_default = False,
)

config_setting(
name = "enable_runfiles",
values = {"enable_runfiles": "true"},
Expand Down
109 changes: 104 additions & 5 deletions lib/private/tar.bzl
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
"Implementation of tar rule"

load("@bazel_skylib//rules:common_settings.bzl", "BuildSettingInfo")
load("//lib:paths.bzl", "to_repository_relative_path")

TAR_TOOLCHAIN_TYPE = "@aspect_bazel_lib//lib:tar_toolchain_type"
Expand Down Expand Up @@ -84,6 +85,7 @@ _tar_attrs = {
doc = "Compress the archive file with a supported algorithm.",
values = _ACCEPTED_COMPRESSION_TYPES,
),
"_compute_unused_inputs": attr.label(default = Label("//lib:tar_compute_unused_inputs")),
}

_mtree_attrs = {
Expand Down Expand Up @@ -125,6 +127,89 @@ def _calculate_runfiles_dir(default_info):
return manifest.short_path[:-9]
fail("manifest path {} seems malformed".format(manifest.short_path))

def _fmt_all_inputs_line(file):
# The tar.all_inputs.txt file has a two columns:
# 1. vis-encoded paths of the files, used in comparison
# 2. un-vis-encoded paths of the files, used for reporting back to Bazel after filtering
path = file.path
return _vis_encode(path) + " " + path

def _fmt_keep_inputs_line(file):
# The tar.keep_inputs.txt file has a single column of vis-encoded paths of the files to keep.
return _vis_encode(file.path)

def _configured_unused_inputs_file(ctx, srcs, keep):
"""
Compute the unused_inputs_list, if configured.
Args:
ctx: `tar` rule context. Must provide `mtree` and `_compute_unused_inputs` attrs , and a `coreutils_toolchain_type` toolchain.
srcs: sequence or depset. The set of all input sources being provided to the `tar` rule.
keep: sequence or depset. A hardcoded set of sources to consider "used" regardless of whether or not they appear in the mtree.
Returns: file or None. List of inputs unused by the `Tar` action.
"""
if not ctx.attr._compute_unused_inputs[BuildSettingInfo].value:
return None

coreutils = ctx.toolchains["@aspect_bazel_lib//lib:coreutils_toolchain_type"].coreutils_info.bin

all_inputs = ctx.actions.declare_file(ctx.attr.name + ".all_inputs.txt")
keep_inputs = ctx.actions.declare_file(ctx.attr.name + ".keep_inputs.txt")
unused_inputs = ctx.actions.declare_file(ctx.attr.name + ".unused_inputs.txt")

ctx.actions.write(
output = all_inputs,
content = ctx.actions.args()
.set_param_file_format("multiline")
.add_all(
srcs,
map_each = _fmt_all_inputs_line,
),
)
ctx.actions.write(
output = keep_inputs,
content = ctx.actions.args()
.set_param_file_format("multiline")
.add_all(
keep,
map_each = _fmt_keep_inputs_line,
),
)

# Unused inputs are inputs that:
# * are in the set of ALL_INPUTS
# * are not found in any content= keyword in the MTREE
# * are not in the hardcoded KEEP_INPUTS set
#
# Comparison and filtering of ALL_INPUTS is performed in the vis-encoded representation, stored in field 1,
# before being written out in the un-vis-encoded form Bazel understands, from field 2.
ctx.actions.run_shell(
outputs = [unused_inputs],
inputs = [all_inputs, keep_inputs, ctx.file.mtree],
tools = [coreutils],
command = '''
"$COREUTILS" join -v 1 \\
<("$COREUTILS" sort -u "$ALL_INPUTS") \\
<("$COREUTILS" sort -u \\
<(grep -o '\\bcontent=\\S*' "$MTREE" | "$COREUTILS" cut -c 9-) \\
"$KEEP_INPUTS" \\
) \\
| "$COREUTILS" cut -d' ' -f 2- \\
> "$UNUSED_INPUTS"
''',
env = {
"COREUTILS": coreutils.path,
"ALL_INPUTS": all_inputs.path,
"KEEP_INPUTS": keep_inputs.path,
"MTREE": ctx.file.mtree.path,
"UNUSED_INPUTS": unused_inputs.path,
},
mnemonic = "UnusedTarInputs",
)

return unused_inputs

def _tar_impl(ctx):
bsdtar = ctx.toolchains[TAR_TOOLCHAIN_TYPE]
inputs = ctx.files.srcs[:]
Expand All @@ -147,15 +232,26 @@ def _tar_impl(ctx):
args.add(ctx.file.mtree, format = "@%s")
inputs.append(ctx.file.mtree)

srcs_runfiles = [
src[DefaultInfo].default_runfiles.files
for src in ctx.attr.srcs
]

unused_inputs_file = _configured_unused_inputs_file(
ctx,
srcs = depset(direct = ctx.files.srcs, transitive = srcs_runfiles),
keep = [ctx.file.mtree, bsdtar.tarinfo.binary],
)
if unused_inputs_file:
inputs.append(unused_inputs_file)

ctx.actions.run(
executable = bsdtar.tarinfo.binary,
inputs = depset(direct = inputs, transitive = [bsdtar.default.files] + [
src[DefaultInfo].default_runfiles.files
for src in ctx.attr.srcs
]),
inputs = depset(direct = inputs, transitive = [bsdtar.default.files] + srcs_runfiles),
outputs = [out],
arguments = [args],
mnemonic = "Tar",
unused_inputs_list = unused_inputs_file,
)

return DefaultInfo(files = depset([out]), runfiles = ctx.runfiles([out]))
Expand Down Expand Up @@ -273,5 +369,8 @@ tar = rule(
doc = "Rule that executes BSD `tar`. Most users should use the [`tar`](#tar) macro, rather than load this directly.",
implementation = tar_lib.implementation,
attrs = tar_lib.attrs,
toolchains = [tar_lib.toolchain_type],
toolchains = [
tar_lib.toolchain_type,
"@aspect_bazel_lib//lib:coreutils_toolchain_type",
],
)

0 comments on commit 6517629

Please sign in to comment.