diff --git a/docs/software/img/jobstats_two_nodes.png b/docs/software/img/jobstats_two_nodes.png new file mode 100644 index 000000000..d10391b34 Binary files /dev/null and b/docs/software/img/jobstats_two_nodes.png differ diff --git a/docs/software/jobstats.md b/docs/software/jobstats.md index a42f63862..e862c0c48 100644 --- a/docs/software/jobstats.md +++ b/docs/software/jobstats.md @@ -2,66 +2,88 @@ ![jobstats plot](./img/jobstats_c_555912-l_1-k_bad_job_01.png) - `jobststats` is an UPPMAX tool to enable discovery of resource usage for jobs submitted to the [Slurm](../cluster_guides/slurm.md) job queue. -``` -jobstats --plot [options] [ -M cluster ] [ jobid [ jobid ... ] | -A project | - ] -``` - -With the -p/--plot option, a plot is produced from the jobstats for each -jobid. Plots contain one panel per booked node showing CPU (blue) and memory -usage (black) traces and include text lines indicating the job number, cluster, -end time and duration, user, project, job name, and usage flags (more on those -below). For memory usage, one or two traces are shown: a solid black line -shows instantaneous memory usage, and a dotted black line -shows overall maximum memory usage if this information is available. - -Plots are saved to the current directory with the name +The most common use is `jobstats --plot` +to see resource use in a graphical plot. -cluster-project-user-jobid.png +## `jobstats --plot` -To view the images you can either download them from UPPMAX, or use Xforwarding. The latter is the quickest way. To do this you will need to connect to UPPMAX with the -Y option +With the `--plot` (or `-p`) option, +a plot is produced showing the resource use per node +for a job that completed successfully and took longer than 5 minutes. -# connect - -To be able to see the plots generated by `jobstats`, -either use [SSH with X-forwarding](../software/ssh_x_forwarding.md) -or [login to a remote desktop](../getting_started/login.md) - -# generate the plots +There are many ways to use `--plot`, a minimal use could be: ``` -$ jobstats --plot -A b2015999 +jobstats --plot [job_id] ``` -and then you can use the image viewer eog to view the files. +for example: ``` -$ eog *.png +jobstats --plot 12345678 ``` -For this to work you will have to use a computer that has a X-server. Most linux systems have this by default, and Macs used to have it as default before they removed it during 2014. To get this feature back you have to install Xquartz. If you are using a Windows computer you can download the program MobaXterm which is a ssh client with a built-in X-server. +The produced plot will be produced in the local folder +with name +`[cluster_name]-[project_name]-[user_name]-[jobid].png`, +for example `rackham-uppmax1234-sven-876543.png`. +Use any image viewer, e.g. [eog](eog.md) to see it. -An example plot, this was named milou-b2010042-douglas-8769275.png: +Each plot shows: -![](./img/jobstats_c_555912-l_1-k_milou-b2010042-douglas-8769275.png) + * detailed information in the title. + * CPU usage in blue + * current memory usage in solid black + * overall memory usage in dotted black (if available) -For multiple-node jobs, plots have a two-column format. +For example, in this plot: -Note that not all jobs will produce jobstats files, particularly if the job was cancelled or ran for less than 5 minutes. Also, if a job booked nodes inefficiently by not using nodes it asked for, jobstats files will not be available for the booked but unused nodes. In this case the plot will contain a blank panel for each such node together with the message 'node booked but unused'. +![jobstats showing a single-node job](./img/jobstats_c_555912-l_1-k_milou-b2010042-douglas-8769275.png) -## Interpretation guidelines + * the title shows the detailed info. `milou` is the name of a former UPPMAX cluster. + * CPU usage in blue, which is around 1000%, which is the equivalent of 10 cores + being used 100% + * current memory usage in solid black, which is around 20 GB + * overall memory usage in dotted black, which is around 340 GB -When you are looking through the plots you just created you can start thinking of how you can change your bookings so that the jobs are more efficient. Usually it's just a matter of changing how many cores you book and the problem is solved. Here are some guidelines that you can follow when looking for inefficient jobs: +For jobs running on multiple nodes, plots have multiple columns: -Is the blue line (the jobs cpu usage) at the top of the graph most of the time (>80%)? If so, no need to do anything, no need to check the rest of this list. -Is the horizontal dotted black line (the jobs max memory usage) close to the top of the graph (>80%)? If so, no need to do anything, no need to check the rest of this list. -If neither of 1 or 2 is true, you should adjust the number of cores you book. Look at where the horizontal dotted black line usually is in the jobs of this type. Check how many GiB of RAM that point represents. Book enough cores, maybe 1-2 cores extra to avoid being too close to the limit if the variance is high, to keep your jobs from exceeding the allowed used RAM. You get 8GiB RAM per core you book. -If you follow these guidelines you will be using the resources efficiently. If everyone did this there would probably not even be a queue to run your jobs most of the time. Of course there are grey areas and jobs that have a very random ram requirements. In these cases it is hard to get efficient usage, but they are few and far between. +![jobstats showing a job that used two nodes](./img/jobstats_two_nodes.png) + +## Interpretation guidelines -Here are some examples of how inefficient jobs can look and what you can do to make them more efficient. +When you are looking through the plots you just created, +you can start thinking of how you can change your bookings +so that the jobs are more efficient. +Usually it's just a matter of changing how many cores you book +and the problem is solved. + +Here are some guidelines that you can follow when looking for inefficient jobs: + + * Is the blue line (the jobs cpu usage) at the top of the graph most of the time (>80%)? + If so, no need to do anything, no need to check the rest of this list. + * Is the horizontal dotted black line (the jobs max memory usage) + close to the top of the graph (>80%)? + If so, no need to do anything, no need to check the rest of this list. + +If neither of 1 or 2 is true, you should adjust the number of cores you book. +Look at where the horizontal dotted black line usually is in the jobs of this type. +Check how many GiB of RAM that point represents. +Book enough cores, maybe 1-2 cores extra to avoid being too close to the limit if the variance is high, +to keep your jobs from exceeding the allowed used RAM. +You get 8GiB RAM per core you book. + +If you follow these guidelines you will be using the resources efficiently. +If everyone did this there would probably not even be a queue +to run your jobs most of the time. +Of course there are grey areas and jobs that have a very random ram requirements. +In these cases it is hard to get efficient usage, but they are few and far between. + +Here are some examples of how inefficient jobs can look +and what you can do to make them more efficient. ### Inefficient job example 1