-
-
Notifications
You must be signed in to change notification settings - Fork 316
Integration with Cluster Management
This aims to demonstrate basic usage of a cluster management software from within SCons. It provides guidance on how to structure your sources and targets so that SCons natively understands the dependency tree and can run jobs through a cluster management software like slurm or Grid Engine. An example would be using SCons to drive a numerical experiment where you need to build some executable, run the executable under a wide variety of inputs, and then generate figures or tables from the results.
This recipe makes the following assumptions:
- You are already familiar with SCons and the usage of Command
- You are familiar with the cluster management tool you are using and, in particular, understand how to dispatch jobs
- You have build steps that would benefit from running on a cluster
This recipe uses slurm, but the principles should be applicable to any other cluster management tool.
The first step is to ensure all of your source and target names are
deterministic. It is common for cluster management tools to create
files such as <some deterministic tag>-<incrementing job id>.out
where
the first tag is easy to predict, but the incrementing job id is only
know after a job is submitted to the cluster. This does not work well
with SCons' dependency tracker. Instead, you should make liberal use of
your cluster management tool's redirection flags to rename the file.
This will ensure SCons is able to determine when a dependency is out of
date and relaunch jobs accordingly.
Let's take the following steps as the high-level description of what the build needs to accomplish:
- Run some task locally on a single host
- Run a long running job on a single host of the cluster
- Distribute many jobs across an arbitrary number of hosts on the cluster
Each step is described in the following sections. The full working examples are provided at the end of this recipe.
Running a job on the local host uses the standard SCons utilities to define a target. This could be from a Builder, Command, AddMethod, or any other means to create the target from the sources.
table = env.Command("example-table.txt", "generate-table.py",
action="python $SOURCE -o $TARGET $N",
varlist=("N",), N=N)
The next step is to send a single job to the cluster. In principle,
this would be a long running or, perhaps, a job that needs additional
resources available from a cluster host. The key detail is to use the
single job submission command (srun
for slurm, qrsh
or qlogin
for
Grid Engine, etc.).
stem = File("table")
tables = env.Command([f"{stem.name}.{_:05d}.txt" for _ in range(N)],
["split-tables.py" + table,
action="srun python ${SOURCES[0]} -s $stem",
varlist=("stem",), stem=stem)
Notice a few key details in this recipe. First, the file names are
deterministic and are passed to env.Command
. This is critical to
allow SCons to know what file will be generated so it can track any
downstream dependencies and what potentially needs to be cleaned.
Second, the stem of the generated files is declared as a File
so it is
available as the stem of the targets and to get the correct path when
variable substitution occurs for the action string.
The last step is distributing a collection of jobs to the cluster. This is primarily targeting array jobs where each gets the same job submission but some logic handles the adjustments based on the job's specific id in the array.
action = ["sbatch", "--wait", f"--array=0-{N-1}", "--output=/dev/null",
f"--wrap='python ${{SOURCES[0]}} {stem.name}.`printf %05d $$SLURM_ARRAY_TASK_ID`.txt'"
]
pngs = env.Command([str(Path(_.name).with_suffix(".png)) for _ in tables],
["imshow.py"] + tables, action=" ".join(action))
Notice a few key details in this recipe. First, as before, the targets
have predictable names that are passed to SCons. Second, the --wait
flag tells slurm to wait for all of the jobs to complete before
returning. This allows SCons to pause and wait for all of the targets
to be generated (or a failure) before it tries to build any
dependencies. Third, the output is specifically redirected to
/dev/null
to prevent the creation of unpredictable files. If you want
outputs from the jobs, explicitly log them to a predictable file name.
Also note the use of --wrap
in this action. In this context, it tells
sbatch
to wrap it's argument in a simple shell shell script for
submission. This allows the command to gain access to the jobs task id
via $SLURM_ARRAY_TASK_ID
. Alternatively, you could provide a script
that handles obtaining the task id and running the
appropriate job. The key point is each task in the array uses the same
command line and arguments.
The following are the full files used in the above snippets demonstrating full use.
#!/usr/bin/env python
import os
from pathlib import Path
env = Environment(ENV=os.environ)
N = 100
"""A nominal size we will need multiple times"""
# Do the big long preprocessing step that has to be done locally before
# we can do other work.
table = env.Command("example-table.txt", "generate-table.py",
action="python $SOURCE -o $TARGET $N",
varlist=("N",), N=N,
)
# Now run some potentially long running task that depends on the
# previous step, but must be run on a single host. This can be
# dispatched using srun.
stem = File("table")
tables = env.Command([f"{stem.name}.{_:05d}.txt" for _ in range(N)],
["split-tables.py"] + table,
action="srun python ${SOURCES[0]} -s $stem ${SOURCES[1]}",
varlist=("stem",), stem=stem,
)
# Now we do something with our results from above. This is the
# embarrassingly parallel portion that can be distributed across the
# hosts using sbatch. Note, this still requires the user to know a sane
# set of flags to sbatch to get the array jobs right. This includes
# using `--wrap` to gain access to `$SLURM_ARRAY_TASK_ID` across the
# jobs. The logic to grab the task ID could potentially be in the
# script submitted to `sbatch` instead.
action = ["sbatch", "--wait", f"--array=0-{N-1}", "--output=/dev/null",
"--wrap='python ${{SOURCES[0]}} {stem.name}.`printf %05d $$SLURM_ARRAY_TASK_ID`.txt'",
]
pngs = env.Command([str(Path(_.name).with_suffix(".png")) for _ in tables],
["imshow.py"] + tables, action=" ".join(action)
)
#!/usr/bin/env python
#
#SBATCH -o /dev/null
import argparse
import numpy
parser = argparse.ArgumentParser(description="""
Generate a table of values that can be used to demonstrate the
processing pipeline. This generates N tables with y rows and x columns
with a random variable in the range [vmin, vmax) for the colormap. The
tables will need to be split into individual tables by a later
processing stage.
""")
parser.add_argument("N", type=int, help="the number of tables to generate")
parser.add_argument("seed", nargs="?", type=int, default=42,
help="the seed for the random number generator")
parser.add_argument("-x", type=int, default=32,
help="the width (columns) of each table: %(default)s")
parser.add_argument("-y", type=int, default=16,
help="the height (rows) of each table: %(default)s")
parser.add_argument("-m", "--vmin", type=float, default=0.0,
help="the minimum value: %(default)s")
parser.add_argument("-M", "--vmax", type=float, default=1.0,
help="the maximum value: %(default)s")
parser.add_argument("-o", "--output", type=argparse.FileType("w"),
default="-", help="the output file")
args = parser.parse_args()
xx, yy = numpy.meshgrid(numpy.arange(1, args.x + 1),
numpy.arange(1, args.y + 1),
indexing="ij")
rng = numpy.random.default_rng(seed=args.seed)
zz = rng.random(numpy.prod(xx.shape) * args.N)
tt = numpy.hstack([numpy.ones(xx.flatten().shape) * _
for _ in range(args.N)])
numpy.savetxt(args.output, numpy.vstack((tt.flatten(),
numpy.tile(xx.flatten(), args.N),
numpy.tile(yy.flatten(), args.N),
zz.flatten())).T)
#!/usrbin/env python
#
#SBATCH -o /dev/null
import argparse
import numpy
parser = argparse.ArgumentParser(description="""
Generate a single table for easier parsing later. This is a stand in
for files that could be generated by a process on a single host that
generates multiple outputs.
""")
parser.add_argument("table", help="the full input table")
parser.add_argument("-s", "--stem", default="table", help="output stem")
args = parser.parse_args()
table = numpy.loadtxt(args.table)
for count, frame in enumerate(numpy.unique(table[:, 0])):
with open(f"{args.stem}.{count:05d}.txt", "w") as _:
print(table[:, 3].min(), table[:, 3].max(), numpy.nan, numpy.nan,
file=_)
numpy.savetxt(_, table[table[:, 0] == frame])
#!/usr/bin/env python
#
import argparse
import pathlib
import numpy
from matplotlib import pyplot
parser = argparse.ArgumentParser(description="""
Generate an image plot of a table.
""")
parser.add_argument("table", help="the individual table to plot")
parser.add_argument("-s", "--show", action="store_true",
help="show the figure instead of saving")
args = parser.parse_args()
table = numpy.loadtxt(args.table)
fig, axs = pyplot.subplots(1, 1, layout="constrained")
img = axs.imshow(table[1:, 3].reshape((len(numpy.unique(table[1:, 2])),
len(numpy.unique(table[1:, 1])),
)),
vmin=table[0, 0], vmax=table[0, 1])
cbr = fig.colorbar(img, ax=axs)
axs.set_xlabel("$x$")
axs.set_ylabel("$y$")
cbr.set_label("$z$")
if args.show:
pyplot.show()
else:
print(args.table)
fig.savefig(pathlib.Path(args.table).with_suffix(".png"))