Skip to content

Commit

Permalink
Improve jobstats, progress #35
Browse files Browse the repository at this point in the history
  • Loading branch information
richelbilderbeek committed May 8, 2024
1 parent 4f1c680 commit 505cd4c
Show file tree
Hide file tree
Showing 2 changed files with 60 additions and 38 deletions.
Binary file added docs/software/img/jobstats_two_nodes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
98 changes: 60 additions & 38 deletions docs/software/jobstats.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,66 +2,88 @@

![jobstats plot](./img/jobstats_c_555912-l_1-k_bad_job_01.png)


`jobststats` is an UPPMAX tool to enable discovery of resource usage
for jobs submitted to the [Slurm](../cluster_guides/slurm.md) job queue.

```
jobstats --plot [options] [ -M cluster ] [ jobid [ jobid ... ] | -A project | - ]
```

With the -p/--plot option, a plot is produced from the jobstats for each
jobid. Plots contain one panel per booked node showing CPU (blue) and memory
usage (black) traces and include text lines indicating the job number, cluster,
end time and duration, user, project, job name, and usage flags (more on those
below). For memory usage, one or two traces are shown: a solid black line
shows instantaneous memory usage, and a dotted black line
shows overall maximum memory usage if this information is available.

Plots are saved to the current directory with the name
The most common use is `jobstats --plot`
to see resource use in a graphical plot.

cluster-project-user-jobid.png
## `jobstats --plot`

To view the images you can either download them from UPPMAX, or use Xforwarding. The latter is the quickest way. To do this you will need to connect to UPPMAX with the -Y option
With the `--plot` (or `-p`) option,
a plot is produced showing the resource use per node
for a job that completed successfully and took longer than 5 minutes.

# connect

To be able to see the plots generated by `jobstats`,
either use [SSH with X-forwarding](../software/ssh_x_forwarding.md)
or [login to a remote desktop](../getting_started/login.md)

# generate the plots
There are many ways to use `--plot`, a minimal use could be:

```
$ jobstats --plot -A b2015999
jobstats --plot [job_id]
```

and then you can use the image viewer eog to view the files.
for example:

```
$ eog *.png
jobstats --plot 12345678
```

For this to work you will have to use a computer that has a X-server. Most linux systems have this by default, and Macs used to have it as default before they removed it during 2014. To get this feature back you have to install Xquartz. If you are using a Windows computer you can download the program MobaXterm which is a ssh client with a built-in X-server.
The produced plot will be produced in the local folder
with name
`[cluster_name]-[project_name]-[user_name]-[jobid].png`,
for example `rackham-uppmax1234-sven-876543.png`.
Use any image viewer, e.g. [eog](eog.md) to see it.

An example plot, this was named milou-b2010042-douglas-8769275.png:
Each plot shows:

![](./img/jobstats_c_555912-l_1-k_milou-b2010042-douglas-8769275.png)
* detailed information in the title.
* CPU usage in blue
* current memory usage in solid black
* overall memory usage in dotted black (if available)

For multiple-node jobs, plots have a two-column format.
For example, in this plot:

Note that not all jobs will produce jobstats files, particularly if the job was cancelled or ran for less than 5 minutes. Also, if a job booked nodes inefficiently by not using nodes it asked for, jobstats files will not be available for the booked but unused nodes. In this case the plot will contain a blank panel for each such node together with the message 'node booked but unused'.
![jobstats showing a single-node job](./img/jobstats_c_555912-l_1-k_milou-b2010042-douglas-8769275.png)

## Interpretation guidelines
* the title shows the detailed info. `milou` is the name of a former UPPMAX cluster.
* CPU usage in blue, which is around 1000%, which is the equivalent of 10 cores
being used 100%
* current memory usage in solid black, which is around 20 GB
* overall memory usage in dotted black, which is around 340 GB

When you are looking through the plots you just created you can start thinking of how you can change your bookings so that the jobs are more efficient. Usually it's just a matter of changing how many cores you book and the problem is solved. Here are some guidelines that you can follow when looking for inefficient jobs:
For jobs running on multiple nodes, plots have multiple columns:

Is the blue line (the jobs cpu usage) at the top of the graph most of the time (>80%)? If so, no need to do anything, no need to check the rest of this list.
Is the horizontal dotted black line (the jobs max memory usage) close to the top of the graph (>80%)? If so, no need to do anything, no need to check the rest of this list.
If neither of 1 or 2 is true, you should adjust the number of cores you book. Look at where the horizontal dotted black line usually is in the jobs of this type. Check how many GiB of RAM that point represents. Book enough cores, maybe 1-2 cores extra to avoid being too close to the limit if the variance is high, to keep your jobs from exceeding the allowed used RAM. You get 8GiB RAM per core you book.
If you follow these guidelines you will be using the resources efficiently. If everyone did this there would probably not even be a queue to run your jobs most of the time. Of course there are grey areas and jobs that have a very random ram requirements. In these cases it is hard to get efficient usage, but they are few and far between.
![jobstats showing a job that used two nodes](./img/jobstats_two_nodes.png)

## Interpretation guidelines

Here are some examples of how inefficient jobs can look and what you can do to make them more efficient.
When you are looking through the plots you just created,
you can start thinking of how you can change your bookings
so that the jobs are more efficient.
Usually it's just a matter of changing how many cores you book
and the problem is solved.

Here are some guidelines that you can follow when looking for inefficient jobs:

* Is the blue line (the jobs cpu usage) at the top of the graph most of the time (>80%)?
If so, no need to do anything, no need to check the rest of this list.
* Is the horizontal dotted black line (the jobs max memory usage)
close to the top of the graph (>80%)?
If so, no need to do anything, no need to check the rest of this list.

If neither of 1 or 2 is true, you should adjust the number of cores you book.
Look at where the horizontal dotted black line usually is in the jobs of this type.
Check how many GiB of RAM that point represents.
Book enough cores, maybe 1-2 cores extra to avoid being too close to the limit if the variance is high,
to keep your jobs from exceeding the allowed used RAM.
You get 8GiB RAM per core you book.

If you follow these guidelines you will be using the resources efficiently.
If everyone did this there would probably not even be a queue
to run your jobs most of the time.
Of course there are grey areas and jobs that have a very random ram requirements.
In these cases it is hard to get efficient usage, but they are few and far between.

Here are some examples of how inefficient jobs can look
and what you can do to make them more efficient.

### Inefficient job example 1

Expand Down

0 comments on commit 505cd4c

Please sign in to comment.